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MULTI-MODEL  ALGORITHMS 
FOR 

OPTIMIZATION 


by 

RICHARD  GORDON  CARTER 

ABSTRACT 

A  recent  approach  for  the  construction  of  nonlinear  optimization  software  has 
been  to  allow  an  algorithm  to  choose  between  two  possible  models  to  the  objective 
function  at  each  iteration.  The  model  switching  algorithm  NL2SOL  of  Dennis,  Gay 
and  Welsch  and  the  hybrid  algorithms  of  Al-Baali  and  Fletcher  have  proven  highly 
effective  in  practice.  Although  not  explicitly  formulated  as  multi-model  methods, 
many  other  algorithms  implicitly  perform  a  model  switch  under  certain  circumstances 
(e.g.,  resetting  a  secant  model  to  the  exact  value  of  the  Hessian). 

We  present  a  trust  region  formulation  for  multi-model  methods  which  allows  the 
efficient  incorporation  of  an  arbitrary  number  of  models.  Global  convergence  can  be 
shown  for  three  classes  of  algorithms  under  different  assumptions  on  the  models. 
First,  essentially  any  multi-model  algorithm  is  globally  convergent  if  each  of  the 
models  is  sufficiently  well  behaved.  Second,  algorithms  based  on  the  central  feature 


of  the  NL2SOL  switching  system  are  globally  convergent  if  one  model  is  well 
behaved  and  each  other  model  obeys  a  “sufficient  predicted  decrease”  condition.  No 
requirement  is  made  that  these  alternate  models  be  quadratic.  Third,  algorithms  of 
the  second  type  which  directly  enforce  the  “sufficient  predicted  decrease”  condition 
are  globally  convergent  if  a  single  model  is  sufficiently  well  behaved. 
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CHAPTER  1 


General  Introduction  :  Multi-model  Algorithms  for  Optimization 

1.1.  Introduction 

In  this  thesis,  we  are  concerned  with  a  class  of  algorithms  for  the  solution  of 
problems  of  the  form: 

minimize /(x)  (1.1) 

where  /  :  IR"  ->  IR1  is  twice  continuously  differentiable  with  gradient  g(x)  and 
Hessian  H(x).  We  refer  to  this  as  the  unconstrained  minimization  problem. 
Historically,  many  different  solution  algorithms  have  been  created  for  this  problem, 
most  of  which  are  based  on  different  ways  of  creating  a  local  model  of  the  objective 
function.  At  each  iteration,  some  scheme  produces  a  single  model  for  the  objective 
function  and  uses  this  to  determine  the  next  approximation  to  the  function  minimizer. 
Consider  Newton’s  method  for  solving  (1.1). 
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Algorithm  A.  1.1  :  Newton’s  Method 

0)  Initialize. 

1)  Repeat  until  convergence  : 

Set  xk+i  =  xk  -  H(xkTlg(xk)  . 

2)  End. 


For  a  convex1  function  f  ,  this  is  equivalent  to  the  following  algorithm. 


Algorithm  A.  1.2  :  Newton’s  Method 

0)  Initialize. 

1)  Repeat  until  convergence  2  : 

Set xk+i  =  argmin{f(xk)+g(xky  (x  -xk  )  +  ~(x  -xk  )‘ H (xk)(x-xk  ) }. 

2)  End  . 


The  function  /  is  modeled  at  each  iteration  by  the  Newton  model,  which  is  the  first 
three  terms  of  a  Taylor  expansion.  New  iterates  are  generated  by  computing  the 
minimizer  of  this  model. 

Since  most  minimization  algorithms  have  been  based  on  quadratic  local  models, 
the  process  of  “producing  a  model”  has  essentially  been  that  of  determining  an 
approximation  to  the  Hessian  of  the  objective  function.  If  the  exact  Hessian  was 

1  A  function  is  said  to  be  convex  if  for  every  pair  of  vectors  j.jeR"  and  every  [0,1]  it  follows 
that  f(^+(\-i)y)<t/(x)+ (K )/  tv  )• 

2  The  notation  x‘  denotes  the  transpose  of  the  vector  * .  The  notation  argmin(  h  (x  )  }  denotes  the  value  of 
*  which  minimizes  the  function  h(x)  provided  such  a  minimizer  exists  and  is  unique. 
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available,  the  Newton  model  could  be  used.  If  it  was  not,  or  if  it  was  too  expensive 
to  calculate,  then  a  finite  difference  Hessian  approximation  could  be  used.  If  this 
seemed  to  be  too  expensive,  one  could  turn  to  one  of  the  secant  methods,  such  as  the 
well  known  BFGS  and  DFP  updates.3  If  the  problem  had  special  structure,  other 
models  could  be  used  to  take  advantage  of  this. 

An  alternative  approach  has  begun  appearing  in  the  literature  over  the  last  few 
years,  notably  in  the  algorithm  NL2SOL  of  Dennis,  Gay,  Welsch  [1981  a,b]  and  the 
works  of  Nazareth  [1980,  1983],  Al-Baali,  Fletcher  [1985],  and  Dennis,  Sheng,  Vu 
[1985].  In  this  approach,  an  algorithm  considers  more  than  one  possible  model  at  a 
given  stage  of  the  iteration.  Nazareth,  Fletcher,  and  Al-Baali  call  such  algorithms 
hybrid  methods,  but  we  refer  to  them  as  multi-model  algorithms.  In  simplest  form, 
algorithms  of  this  type  are  structured  as  follows. 


Algorithm  A.  1.3  :  A  Multi-Model  Algorithm 

0)  Initialize. 

1)  Repeat  until  convergence  : 

2)  Choose  a  model  from  two  or  more  possibilities. 

3)  Compute  an  approximate  minimizer  for  this  model. 

4)  Either  accept  this  minimizer  as  xk+l  and  proceed,  or 

return  to  2)  and  try  another  model. 

5)  End  . 


3  See,  for  example,  Broyden  [1970],  Davidon  [1959]  and  Fletcher,  Powell  [1963]. 
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The  multi-model  algorithms  discussed  above  are  for  a  special  case  of  prob¬ 
lem  (1.1),  the  nonlinear  least  squares  problem  : 

minimize  f  (x  )  =  — -  II  r  (jc  )  1 1 

2  2  (1.2) 

r  :  JR"  ->  IRm  . 

These  algorithms  are  among  the  best  available  methods  for  solving  (1.2),  yet  little  is 
known  concerning  their  theoretical  properties.  Furthermore,  the  literature  focuses  to 
some  extent  on  specific  models  introduced  rather  than  developing  the  concept  of 
multi-model  algorithms  fully,  and  little  consideration  is  given  to  using  multi-model 
algorithms  for  other  special  cases  or  for  general  unconstrained  optimization.  The  lack 
of  a  theoretical  framework  is  particularly  unacceptable  in  light  of  the  observation  that 
other  methods  in  the  literature  can  be  considered  to  be  multi-model  algorithms,  even 
though  not  formulated  explicitly  as  such.  Typical  of  these  are  the  many  algorithms 
that  “reset”  or  “refresh”  a  Hessian  approximation  under  certain  conditions.  These 
conditions  are  typically  defined  arbitrarily,  or  at  best  determined  heuristically. 

The  purpose  of  this  thesis  is  twofold.  First,  the  concept  of  using  multiple 
models  for  general  unconstrained  minimization  is  dissected  at  length  to  clarify  both 
the  structure  and  goals  of  such  algorithms.  We  classify  existing  methods  using  the 
framework  thus  constructed,  discuss  how  well  they  satisfy  our  goals,  and  point  out 
new  ways  of  utilizing  multiple  models.  Second,  we  develop  a  global  convergence 
theory  for  trust  region  implementations  of  a  large  class  of  multi-model  algorithms. 
This  theory  is  surprisingly  general,  both  in  terms  of  the  small  number  of  restrictions 
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on  allowable  model  switching  logic  and  in  terms  of  the  number  of  options  allowed  in 
the  trust  region  logic.  We  can  even  include  nonquadratic  models  in  a  natural  way. 

Throughout  this  paper,  we  use  the  following  definitions  and  notation.  The 
function  to  be  optimized  is  /  (jc  ),  its  gradient  is  g  ( x ),  and  its  Hessian  is  H  (x ).  The 
residual  vector  of  the  nonlinear  least  squares  problem  is  called  r(x),  and  the  Jacobian 
of  r  is  denoted  J(x).  Depending  on  context,  x*  denotes  a  local  minimizer  of  /  or  a 
point  satisfying  g(x)  =  0.  Our  iterate  at  step  k  is  denoted  xk,  and  a  step  away  from 
xk  is  called  p,  pk,  or  pk.  The  expressions  gk  ,  Hk  ,rk  ,  and  Jk  are  shorthand  for 
g(xk)  ,  H(xk)  ,  r(xk)  ,  and  J(xk)  respectively.  An  approximation  to  H (xk)  is 
denoted  by  Bk  or  Bk.  The  Euclidean  norm  of  a  vector  v  e  1R"  is  denoted  by  II  v  II . 
For  a  matrix  B  e  IRnx",  II  B  II  refers  to  the  operator  norm  induced  by  the  Euclidean 
vector  norm,  while  II  B  II  F  refers  to  the  Frobenius  norm  of  B.4  DGW  [1981]  refers 
to  references  Dennis,  Gay,  Welsch  [1981  a,b]. 

In  actual  implementation,  modifications  must  be  added  to  algorithms  such  as 
A.  1.2  and  A.  1.3  to  make  them  practical.  The  most  significant  changes  are  designed 
to  aid  the  algorithm  if  the  initial  iterate  is  far  from  the  solution.  This  can  be  done  by 
adjusting  the  length  of  the  step  just  computed  (a  line  search  procedure)  or  by 
imposing  an  adjustable  constraint  on  the  model  at  each  iteration  (a  trust  region 
procedure).  A  complete  description  of  these  procedures  must  be  deferred  to  later 
chapters,  but  a  rough  sketch  of  the  structure  of  Algorithm  A.  1.3  with  the  inclusion  of 


4  The  Frobenius  norm  of  B  e  Rnxn  is  defined  by  II  B  II F 
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a  trust  region  strategy  is  as  follows. 


Algorithm  A.  1.4  :  A  Multi-Model  Algorithm  with  Trust  Region  Logic 

0)  Initialize. 

1)  Repeat  until  convergence  : 

2)  Select  a  model  from  two  or  more  possibilities,  and  choose  a  trust  radius 

A. 

3)  Compute  an  approximate  local  minimizer  x  for  this  model  subject  to 

II x-xk  II  < A. 

4)  Either  accept  x  as  xk+1  and  proceed,  or 

return  to  2)  and  try  another  model  and/or  trust  radius. 

5)  End  . 


1.2.  Examples 

In  this  section  we  present  examples  of  situations  where  an  algorithm  using  several 
models  should  be  considered.  These  examples  are  included  primarily  for  motivation 
at  this  point  since  detailed  descriptions  of  implementations  must  be  postponed  until 
more  notation  is  established. 

There  are  several  difficulties  that  a  user  must  face  in  choosing  from  an  available 
set  of  algorithms  to  solve  a  typical  problem  and  that  an  expert  must  face  in  designing 
an  algorithm  for  a  specific  type  of  problem.  Four  of  these  difficulties  can  be 
addressed  by  a  multiple  model  algorithm.  The  first  such  difficulty  is  the  identification 
of  the  type  of  problem.  The  second  is  the  proper  utilization  of  models  which  perform 
inconsistently  as  an  iteration  progresses.  The  third  difficulty  is  the  incorporation  of 
newly  derived  models  with  unknown  or  unproven  properties.  A  fourth  difficulty  is 
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that  of  deciding  between  a  mediocre  model  with  low  overhead  and  a  more  expensive 
high  performance  model.  The  first  three  difficulties  are  illustrated  vividly  by  the 
nonlinear  least  squares  problem. 

Example  1.1  :  Nonlinear  Least  Squares  (NLS) 

When  encountered  in  the  real  world,  an  NLS  problem  falls  into  one  of  three 
categories  :  the  zero-residual  case,  the  small  residual  case,  and  the  large  residual  case. 
In  the  zero  residual  case,  r{ x*)  is  zero.  The  other  two  cases  lack  a  precise  definition, 
but  a  problem  is  considered  to  be  a  small  residual  case  if  r(x*)  is  “small”  or  r  is 
“almost  linear”  at  jc*,  and  considered  to  be  large  residual  otherwise. 

Common  methods  for  solving  NLS  problems  have  been  the  Gauss-Newton 
algorithm,  the  Levenberg-Marquardt  algorithm  (Levenberg  [1944],  Marquardt  [1963]), 
and  any  of  the  standard  secant  algorithms  for  unconstrained  optimization.  Both  the 
Gauss-Newton  and  Levenberg-Marquardt  algorithms  use 


<[>(  x+p  )  =  J  \\ J (x)p +r {x)\\ 2 

(1.3a) 

=  /  (x)  +  p‘J(x)‘r(x)  +  jp‘J(xYJ(x)p 

(1.3b) 

as  the  local  model,  while  a  secant  algorithm  uses 

<K*+P  )  =f(x)  +  ptJ(x)‘r(x)+  -jP‘ B  p  .  (1.4) 

These  models  differ  only  in  how  they  approximate  the  Hessian  of  /.  A  secant 
method  approximates  the  true  Hessian  H(x)  using  some  update  formula,  while  the 
other  two  use  the  so-called  Gauss-Newton  Hessian  :  J(x  )‘J{x). 
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Let  us  assume  for  the  moment  that  the  secant  update  we  are  considering  is  the 
well  known  BFGS  update.  Then  properly  constructed  trust  region  5  algorithms  based 
respectively  on  the  two  models  have  the  properties  shown  in  Table  1. 

Now  consider  the  “difficulties”  mentioned  above.  First  consider  identification 
of  problem  type.  We  see  that  if  a  user  wants  to  efficiently  solve  a  problem  using  a 
single  model  algorithm,  it  is  very  important  to  know  whether  or  not  it  is  a  zero 


TABLE  1 

RESULTS  OF  USING  DIFFERENT 
MODELS  IN  OTHERWISE  IDENTICAL  ALGORITHMS 


Algorithm  based  on  Gauss-Newton  Hessian 


Converges  in  one  step  on  linear  problems. 

Q-quadratically  convergent  on  zero  residual  problems. 
Rapidly  Q-linear  on  small  residual  problems. 

Very  slow  linear  convergence  on  large  residual  problems. 
Globally  convergent  under  mild  assumptions  on  r . 


Algorithm  based  on  BFGS  Hessian 


Locally  Q- super  linearly  convergent  for  any  residual  size. 
More  robust  than  an  algorithm  based  on  GN  Hessian. 
Global  convergence  an  open  question. 

Does  not  take  advantage  of  problem  structure. 


5  To  get  the  “global  convergence”  cited  in  Table  1  when  using  a  trust  region  algorithm,  we  need 
only  assume  an  upper  bound  on  the  norm  of  the  Gauss-Newton  Hessian.  Line  search  algorithms  re¬ 
quire  more  restrictive  assumptions,  such  as  a  full  rank  Jacobian. 
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residual  case.  This  information  is  often  not  available  in  advance.  Second,  there  is 
the  problem  of  inconsistent  performance.  Algorithms  based  on  the  Gauss-Newton 
Hessian  often  outperform  secant  methods  for  several  iterations  even  in  relatively  large 
residual  cases.6  Third,  the  structure  of  the  NLS  problem  lends  itself  to  the 
construction  of  alternative  models.  Most  of  these  alternatives  are  generated  by 
different  approximations  to  the  second  order  term  S  ( x ),  where 

SCO  =  H(x)-J(x)‘J( x) 

=Xr(COVJC2r(0O  (L5) 

(=i 

At  least  twenty  different  ways  for  doing  this  exist  in  the  literature.  Some,  such  as 
DGW  [1981],  approximate  S  explicitly,  while  others,  such  as  Nazareth  [1980,  1983] 
or  Al-Baali,  Fletcher  [1985]  use  the  structure  to  treat  S  as  an  implicit  part  of  Bk 
without  explicitly  carrying  along  an  approximation  to  it.  Further  complications  are 
introduced  by  some  authors,  such  as  schemes  for  “sizing”  (DGW  [1981])  or 
“scaling”  (Oren,  Luenburger  [1974])  S  to  allow  it  to  adapt  to  local  conditions  more 
rapidly.  Furthermore,  new  models  continue  to  appear  in  the  literature.  The 
theoretical  properties  of  most  of  these  models  are  not  known.  Even  if  analysis  of 
some  of  these  models  is  possible,  the  existing  schemes  differ  from  one  another 
sufficiently  that  it  is  unlikely  that  this  analysis  would  apply  to  the  whole  group. 

6  One  potential  explanation  of  this  behavior  involves  the  number  of  iterations  required  for  a 
secant  method  to  build  up  second  derivative  information.  Additionally,  the  second  order  information 
inserted  into  Bk  by  the  secant  method  may,  in  some  cases,  become  quickly  outdated  due  to  rapid 
changes  to  H  over  the  course  of  several  long  initial  steps  or  in  the  course  of  rapid  convergence  to  a 
zero  residual  solution.  The  Gauss-Newton  Hessian,  although  incomplete,  bases  itself  only  on  informa¬ 
tion  at  the  current  point. 
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Despite  this  lack  of  analysis,  many  of  these  schemes  perform  well  and  have  been 
used  successfully  for  years.  However,  the  idea  of  trying  out  a  code  for  years  to  see 
whether  it  stands  the  test  of  time  seems  inferior  to  proving  that  it  will  work  when 
originally  coded.  We  show  in  this  thesis  that  a  multi-model  algorithm  can  be 
designed  using  one  model  with  known  properties  and  another  model  having  totally 
unknown  properties  in  such  a  way  that  global  convergence  is  retained.  This  removes 
some  of  the  urgency  about  having  a  full  analysis  of  a  new  model  before  widespread 
use. 

Lest  the  reader  decide  that  multi-model  algorithms  are  only  attractive  for 
nonlinear  least  squares  problems,  let  us  consider  examples  of  multi-model  algorithms 
for  solving  general  unconstrained  problems. 

Example  1.2  :  Constant  Matrix  for  Model  Hessian  versus  the  Newton  Model 

The  evaluation  of  an  exact  Hessian  Hk  is  frequently  prohibitively  expensive. 
Using  a  fixed  approximation  B  does  away  with  this  expense  and  substantially  reduces 
linear  algebra  overhead,  but  may  converge  unacceptably  slowly.  The  following 
multi-model  algorithm  is  a  compromise  between  the  two  extremes. 
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Algorithm  A.  1.5  :  Newton  Model  versus  Constant  model 

0)  Initialize. 

1)  Repeat  until  convergence  : 

2)  Set  Bk=B  . 

3)  Set  jc*+1  to  an  approximate  local  minimizer  of  the  model. 

4)  If  xk+l  is  unsatisfactory,  then 

If  Bk  =B ,  then  set  Bk  =H (xk)  and  go  to  3)  , 

Else  reduce  the  trust  radius  and  go  to  3). 

5)  Adjust  the  trust  radius  and  proceed. 

6)  End  . 


Many  other  compromises  are  possible,  of  course.  This  particular  version  always 
tries  the  inexpensive  model  first,  but  quickly  abandons  it  whenever  faced  with  a  trust 
radius  reduction.  An  alternative  would  be  to  abandon  it  only  if  several  such 
reductions  are  necessary,  or  to  use  another  test  entirely.  Yet  another  alternative  is  to 
replace  “5”  in  the  algorithm  by  “Bk_ j”  so  that  rather  than  initially  trying  a  fixed 
default  model  at  each  iteration,  the  algorithm  initially  tries  the  most  recently 
computed  exact  Hessian.  This  algorithm  and  most  of  its  variations  are  shown  to  be 
globally  convergent  in  Chapter  4. 

Example  1.3  :  Secant  Models  and  Full  Newton  Models 

A  more  useful  example  is  to  apply  a  multi-model  algorithm  such  as  A.  1.5  to  a 
secant  model  and  a  full  Newton  model.  Algorithms  are  already  in  existence  which 
normally  use  a  secant  method  because  of  its  low  expense,  but  which  “refresh”  this 
model  to  the  exact  Hessian  occasionally.  The  theory  in  Chapters  5  and  6  show  how 
this  may  be  done  in  a  globally  convergent  manner. 
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1.3.  Synopsis  In  the  remainder  of  this  thesis,  we  are  concerned  primarily  with 
establishing  global  convergence  for  as  wide  a  class  of  algorithms  as  possible. 
Chapter  2  presents  preliminary  definitions  and  reviews  single  model  trust  region 
methods  and  results.  Chapter  3  presents  in  more  detail  the  goals  one  should  keep  in 
mind  when  designing  a  multi-model  algorithm  and  gives  examples  of  algorithms 
which  fail  to  meet  these  goals.  Chapter  4  presents  a  general  trust  region  algorithm 
for  model  switching  and  establishes  a  form  of  global  convergence  based  on  the 
assumption  that  all  the  models  are  sufficiently  well  behaved.  Specifically,  we  extend 
the  global  convergence  theory  of  Schultz,  Schnabel,  and  Byrd  [1985]  to  allow  the 
inclusion  of  any  finite  number  of  alternative  quadratic  models  with  uniformly 
bounded  curvature.  Chapter  5  presents  a  form  of  the  algorithm  which  is  slightly 
more  restrictive  in  the  choices  allowed  for  model  switching,  but  which  allows  looser 
conditions  on  the  behavior  of  the  models.  The  central  feature  of  this  class  of 
algorithms  is  the  test  used  in  NL2SOL  to  distinguish  between  models.  In  Chapter  6 
we  present  methods  of  exploiting  these  looser  requirements  to  produce  algorithms 
which  are  globally  convergent  if  at  least  one  of  the  models  is  sufficiently  well 
behaved.  The  global  convergence  theory  introduced  in  Chapter  5  and  developed  fully 
in  Chapter  6  is  therefore  a  significant  and  comprehensive  extension  of  standard  trust 
region  theory  to  the  multi-model  case.  Chapters  5  and  6  each  present  a  slight 
variation  of  NL2SOL  which  is  globally  convergent  under  different  assumptions  on  the 
models.  Chapter  7  presents  an  illuminating  example  concerning  local  convergence 
rates  of  multi-model  algorithms  and  discusses  future  work  to  be  done  in  this  area. 


CHAPTER  2 


Trust  Regions  Methods  for  Single  Model  Algorithms 

2.1.  Introduction 

The  two  major  strategies  used  for  improving  the  global  behavior  of  optimization 
algorithms  are  line  search  methods  and  trust  region  methods.  In  this  thesis,  we  will 
restrict  ourselves  to  trust  region  methods.  Although  line  search  methods  generally 
require  less  code  complexity,  trust  region  methods  seem  more  appropriate  for  a 
multi-model  context.  Theory  developed  for  trust  region  methods  can  often  be  applied 
directly  to  line  search  methods.  Trust  region  theory  requires  slightly  less  restrictive 
conditions  on  the  models,  and  trust  region  methods  extend  more  naturally  than  line 
search  methods  to  models  that  are  not  quadratics  with  positive  definite  Hessians. 

In  the  following  sections,  we  present  briefly  the  background  necessary  for  the 
rest  of  this  thesis.  The  first  section  deals  with  elementary  definitions  and  some 
properties  of  functions  in  IRW  that  will  be  used  in  later  chapters.  The  remaining 
sections  review  current  trust  region  theory  and  practice  for  single  model  algorithms. 

Trust  region  algorithms,  particularly  when  applied  to  multiple  models,  involve  a 
fairly  extensive  amount  of  notation.  The  reader  is  encouraged  to  make  liberal  use  of 
Appendix  A,  which  includes  a  brief  glossary  of  the  most  important  abbreviations. 
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acronyms,  symbols,  and  parameters  used  in  this  thesis. 

2.2.  Preliminaries 

2.2.1.  Models 

Consider  the  objective  function  /  :  IR"  -» IR1  and  let  xk  be  a  vector  in  IR" .  A 
model  of  /  with  index  i  is  denoted  by 

<t>/'(jc)  :  0/  — >  IR1  ,  (2.1a) 

where  Qk  is  a  subset  of  IR"  containing  xk .  We  also  define  an  associated  function 

predlk{p )  =  /(**)-  <| )k(xk+p )  (2.1b) 

and  call  this  the  predicted  reduction.  Notice  that  predk  is  expressed  in  local 
coordinates  about  the  point  xk .  We  also  define  the  actual  function  reduction  as 

aredk (p)  =  f(xk)  -  f  (xk+p )  ,  (2. lc) 

so  that  predk  is  intended  to  approximate  aredk.  For  convenience,  we  will  usually 
refer  to  predk  as  the  model  rather  than  §k. 

For  the  rest  of  this  chapter,  we  will  assume  that  a  single  quadratic  model  is 
computed  at  each  iteration  k  and  drop  the  superscripts  from  §k  and  predk.  For 
xk  €  IR"  and  a  symmetric  matrix  Bk ,  a  quadratic  model  in  standard  form  refers  to 

$k (x )  =  / (** ) + 8  (xk ( x ~xk)  +  \  (x ~xk Bk  (* ~xk )  (2.2a) 


or  the  associated  function 
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predk(p)=-g‘kp  -^-p‘Bkp  .  (2.2b) 

The  matrix  Bk  is  said  to  be  the  model  Hessian. 

Even  a  partial  treatment  of  the  models  commonly  used  in  optimization  is  beyond 
the  scope  of  this  thesis.  Although  not  completely  necessary,  a  basic  familiarity  on  the 
part  of  the  reader  with  such  topics  as  finite  difference  approximations  of  gradients 
and  Hessians,  secant  methods,  and  elementary  numerical  linear  algebra  is  strongly 
recommended.  A  concise  yet  thorough  introduction  to  these  subjects  can  be  found  in 
Dennis  and  Schnabel  [1983]. 

2.2.2.  Elementary  Notation 

We  make  the  following  definitions. 

Definition  2.1  :  The  sequence  {xk  }  in  IRn  is  said  to  converge  if  there  exists  x*  such 

that  lim  \\xk-x*  II  =  0  .  Furthermore,  this  is  denoted  by  xk  — >x  . 
k  — 

Definition  2.2  :  The  sequence  {xk }  in  IRn  is  said  to  be  first  order  stationary  point 
convergent  if  gk  — >0.  Henceforth  this  will  be  denoted  by  FOSPC  for  brevity. 
FOSPC  will  also  refer  to  an  algorithm  that  generates  sequences  of  iterates 
{xk  }  that  are  FOSPC. 

Definition  2.3  :  The  sequence  {xk }  in  IR”  is  said  to  be  weak  first  order  stationary 
point  convergent  if  lim  inf  II  gk  II  =  0  .  This  property  will  also  be  denoted  by 


the  abbreviation  WFOSPC. 
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Notice  that  [x*  }  is  WFOSPC  if  and  only  if  some  subsequence  of  gk  ->  0. 

Definition  2.4  :  A  set  Q  e  IR”  is  said  to  be  convex  if  for  every  pair  of  vectors 
x  ,y  cfl  and  every  [0,1]  it  follows  that  t;x  +(l-£)y  e  Q  . 

Definition  2.5  :  A  sequence  [xk }  is  said  to  have  the  descent  property  with  respect 
to  /  if  f  (x*+1)  <  fix k)  for  all  k  >  1. 

Definition  2.6  :  Given  a  point  x{e  IR",  the  set  Lif  pcx)  =  {  x  :f(x)  <f(xx)  }  is 
called  the  level  set  of  /  at  xx.  Furthermore,  the  closed  convex  set 

L(fjl)  =  {x:x=^y+(l-£,)z,  ^  e  [0,1],  y,z  e  } 

is  said  to  be  the  convex  hull  of  L(f  jcx). 

Definition  2.7  :  Given  a  convex  set  £2clR"  ,  a  function  /  iQ-^IR1  is  said  to  be 
convex  if  for  every  pair  of  vectors  x  ,y  e  Q  and  every  £,e  [0,1]  it  follows  that 
/(^x+(l-^)y)<^/(x)+(K)/(y). 

2.2.3.  Some  Multivariant  Calculus  Background 

In  future  sections  we  will  make  use  of  the  following  definitions  and  results.  Lemmas 

2.1  through  2.5  are  adapted  from  Lemmas  4.1.2,  4.1.5,  and  4.1.9  of  Dennis  and 

Schnabel  [1983]. 

Definition  2.8  :  For  a  given  x  e  JR71  and  nonzero  p  s  IR" ,  the  directional  derivative 
of  /  at  x  in  the  direction  p  is 


4^-00=  lim  — (/  (x  +ep  )-/(*)) 
dp  e-40  e 
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if  the  limit  exists.  The  second  directional  derivative  of  /  at  x  in  the 
direction  p  is 

CO  =  lim  —  +ep  )-^f-(x)) 

e->0  e  dp  dp 

if  the  limit  exists. 

Lemma  2.1  :  If  the  function  f :  IR"  — >  IR1  is  continuously  differentiable  with  gradient 
g  in  an  open  set  containing  the  point  x ,  then  for  any  nonzero  p  e  IR” ,  the 
directional  derivative  of  f  at  x  in  the  direction  p  exists  and  has  the  value 
g(x)‘p. 

Lemma  2.2  :  If  the  function  f :  IR”  -» IR1  is  twice  continuously  differentiable  with 
Hessian  H  in  an  open  set  containing  the  point  x,  then  for  any  nonzero 
p  €  IR" ,  the  second  directional  derivative  of  f  at  x  in  the  direction  p  exists 
and  has  the  value  pl H{x)p . 

We  will  often  use  the  second  directional  derivative  to  characterize  the  shape  of  a 
function. 

Definition  2.9  :  For  a  nonzero  vector  p  e  IR" ,  ^  tLS-)P-  js  sai<j  to  be  the  curvature 

P‘P 

of  /  in  the  direction  p . 


aV 

3  2p 


Definition  2.10  :  Any  vector  p  e  IR"  that  satisfies  p‘ H(x)p  <0  is  said  to  be  a 
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direction  of  negative  curvature  for  the  function  /  at  the  point  x . 

Definition  2.11  :  A  symmetric  matrix  B  e  lRwxn  is  said  to  be  positive  definite  if 
x‘Bx>  0  for  every  nonzero  teR",  positive  semidefinite  if  x‘Bx  >0  for 
every  x  s  IR",  indefinite  if  x‘Bx  >0  for  some  x  e  IR"  and  yl  By  <0  for  some 
y  g  IR",  and  negative  definite  if  x‘Bx  <0  for  every  nonzero  x  e  IR". 

Definition  2.12  :  A  point  x*  e  IR"  is  said  to  be  a  saddle  point  of  the  function  /  if 
g(jc*)  =  0  and  H(x*)  is  indefinite. 

Lemma  2.3  :  If  the  function  f :  IR"  -» IR1  is  continuously  differentiable  in  an  open 
convex  set  Qc!R",  then  for  any  x ,  x  +p  e  Q 

l 

f{x+p)-f{x)  =  jg(x+Xp)‘p  dX  .  (2.3) 

o 

Lemma  2.4  :  If  the  function  F  :  IR”  -» IRm  is  continuously  differentiable  in  an  open 
convex  set  £2  c  IR" ,  then  for  any  x,x  +p  e  Q, 

l 

F(x+p)-F(x)=fVF(x+\p)‘pdX  (2.4) 

o 

where  V  F  is  the  transpose  of  the  Jacobian  of  the  function  F . 

Lemma  2.5  :  If  the  function  F :  IR”  ->  IRm  is  continuously  differentiable  in  an  open 
convex  set  £ldRn,  then  for  any  x,x  +p  e  Q, 

l 

II  F(x  +p)-F  (x)  II  <  llpll  Jll  VF(x+Xp)ll  d\ 

0 


(2.5) 
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Lemma  2.6  :  If  the  function  f :  IR”  -»  IR1  is  twice  continuously  differentiable  in  an 
open  convex  set  Q  c  lRn ,  then  for  any  x,  x  +p  e  Q. 

l 

f(x)-f(x+p)  =  -g(x)tp-jptH(x+ty)p(l-£3)d?3  .  (2.6) 

o 

Furthermore,  if  xk,  xk  +pk  e  Q  and  if  predk  is  a  quadratic  model  in  standard 
form,  we  have 

l 

aredk(pk)-predk(pk)  =  J(1  -£)(p*)'  (Bk  -H(xk  +£pk))(pk)d%  .  (2.7) 

o 

Proof  :  From  Lemma  2.3  we  have  that 

l 

f(x)-f(x+p)  =-jg(x+Xp)lpdk  . 

o 

From  Lemma  2.4  we  have  that 

l 

g  (x  +  Xp )  =  g  (x ) + 1' V g  (x  +  Tip  y  Xpd x 

0 

X 

=  g(x)  +  jH(x  +  Zp)pd£,  . 

o 


Combining  the  two  gives 
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f(x)-f(x+p)  = 


-j  g(x)  +  jH(x+^p)p  d%  p  dX 

0  0 


-jg(x)‘pdX~jjp,H(x+lp)pd^dX 

o  oo 


-g(x)‘p-j  \p‘H(x+ty)p  dX  d % 


-g(x)‘p  — J(1  -Z,)p‘H(x+ty)p  d £, 
o 


which  proves  the  first  part  of  the  lemma.  We  then  have 


predk ( pk)  =  ~g(xk )‘pk  -  y  ( pk)‘  Bk  (pk ) 


=  ~g  (• h)‘Pk-(Pk Bk 

o 


-g(xk)‘pk-\(l-^)(pk)‘  Bk  (pk)d £, 
o 


so  that  setting  x  =xk  and  p=pk  and  subtracting  (2.9)  from  (2.8)  establishes  the 
second  part  of  the  lemma.  Q 


2.3.  The  Trust  Region  Subproblem 


Before  considering  the  full  trust  region  algorithm,  we  must  first  discuss  what  is 
known  as  the  trust  region  subproblem. 

Definition  2.13  :  Given  Ak  >0,  the  trust  region  subproblem  (denoted  TRS)  is 


maximize  predk(p )  . 

lip  II  <  A* 


(TRS) 


The  ball  { p  :  II  p  i!  <  A*  }  is  known  as  the  trust  region. 
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An  important  practical  point  in  trust  region  algorithms  is  the  inclusion  of  a 
positive  definite  scaling  matrix,  D ,  which  replaces  the  spherical  trust  region 
heretofore  assumed  with  the  hyperellipse 

II  Dp  II  <A*  .  (2.10) 

Typically,  D  is  a  diagonal  matrix.  If  D  is  constant,  then  its  inclusion  amounts  to  a 
simple  change  of  variables  w  =  Dp  that  converts  the  problem  defined  by  (2.10)  back 
to  the  standard  problem  TRS  with 

predk(w)  =  -gk  D~{w  -  yvv' D~^BkD~lw 

so  that  the  trust  region  theory  is  unaffected.  Other  trust  region  methods  use  a 
sequence  of  scaling  matrices  { Dk  }  instead  of  a  single  fixed  D .  Much  of  the  existing 
global  convergence  theory  can  be  extended  to  these  methods  provided  both  { Dk }  and 
{Dkl }  are  uniformly  bounded.  An  excellent  reference  to  scaling  matrices  is  More  and 
Sorensen  [1983].  For  simplicity,  we  will  take  Dk  =/  throughout  this  thesis. 

Computation  of  an  exact  solution  to  problem  (TRS)  is  not  practical  except  in 
certain  special  cases  (such  as  when  §k  is  convex  and  the  constraint  lip  II  <A*  is  not 
binding).  Consequently  much  of  the  literature  deals  with  methods  of  computing 
approximate  solutions  to  the  problem  TRS.  A  good  introduction  to  these  techniques 
is  Dennis  and  Schnabel  [1983].  Some  excellent  references  for  a  more  advanced 
treatment  are  More  [1982],  More  and  Sorensen  [1982,  1983],  Schultz,  Schnabel,  and 
Byrd  [1985]  and  Steihaug  [1981].  A  very  brief  review  of  some  of  these  methods 


follows. 
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2.3.1.  Optimal  Locally  Constrained  Steps 

Some  methods  approximate  a  solution  to  problem  TRS  by  computing  an  exact 
solution  to  a  “nearby”  problem  which  is  easier  to  solve.  In  this  context,  a 
convenient  “nearby”  problem  is  to  maximize  predk  subject  to  a  different  constraint 
(say,  lip  II  < A*  where  Ak~Ak).  We  refer  to  such  methods  as  optimal  locally 
constrained  (or  OLC)  methods.  The  reader  is  referred  to  More  and  Sorensen  [1983] 
for  the  history,  theory  and  practice  of  these  methods,  but  for  the  purposes  of  this 
discussion  we  only  need  to  point  out  a  few  properties. 

OLC  steps  (that  is,  the  approximate  solutions  to  problem  TRS  generated  by  an 
OLC  method)  do  not  necessarily  lie  in  the  trust  region,  but  are  allowed  to  lie  in  the 
region  {p :llp  II  <  (1  +CTj  )A* }  where  Oje(0,l).  NL2SOL  uses  ai  =  0.1,  but 
algorithms  which  use  values  such  as  ol=0.5  are  also  found  in  the  literature.  A 
minimal  requirement  for  an  OLC  step  is  that  the  predicted  reduction  for  this  step 
must  be  at  least  a  given  fraction  of  the  predicted  reduction  of  the  exact  solution  to 
problem  TRS.  That  is,  if  pk  is  the  solution  to  problem  TRS  and  pk  is  an 
approximation  generated  by  an  OLC  method,  there  exists  c2> 0  independent  of  k 
such  that 

predk  (pk )  >  c2predk  (pk  )  .  (2.1 2) 

More  and  Sorensen  [1983]  present  an  algorithm  that  will  generate  a  pk  satisfying 
(2.12)  in  a  finite  number  of  steps.  Furthermore,  computational  experience  indicates 
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that  on  the  average  less  than  two  steps  (each  involving  the  Cholesky1  factorization  of 
an  AiXrt  matrix)  are  required  to  generate  pk  satisfying  (2.12).  The  algorithm  does  not 
require  that  <j)*  be  convex,  and  in  fact  makes  use  of  directions  of  negative  curvature 
to  inhibit  convergence  to  saddle  points  and  to  choose  directions  of  rapid  decrease. 

2.3.2.  Restricted  Subspace  Methods 

Another  approach  to  approximating  pk  is  to  consider  only  the  values  of  p  in  a  given 
subspace  £2*  c  IR" . 

Definition  2.14  :  Given  a  scalar  A*  >0  and  a  subspace  £2*  c IR",  the  restricted 
subspace  trust  region  subproblem  is 

maximize  predk  ip ) 

p (2.13) 

S/t  lip  II  <A*  . 

A  restricted  subspace  method  will  compute  an  approximate  solution  to  (2.13), 
and  use  this  as  approximation  to  pk.  If  Q.k  is  of  small  dimension  when  compared  to 
IR",  then  such  a  method  can  subsequently  reduce  overhead.  Typically,  £2*  is  chosen 
to  be  the  two  dimensional  space  spanned  by  -gk  (the  direction  of  steepest  descent) 
and  ~(Bk)~lgk  (the  quasi-Newton  direction).  For  positive  definite  Bk,  these  two 
directions  are  natural  because  they  represent  pk  in  the  two  limiting  cases  A*  — >0  and 
A*  -»oo,  respectively.  Furthermore,  solving  (2.13)  over  a  two  dimensional  subspace 
is  equivalent  to  solving  a  fourth  order  polynomial  in  one  unknown,  which  is  a 


1  See,  for  example,  Golub  and  Van  Loan  [1983],  Section  5.2. 


24 


computationally  tractable  problem.  A  disadvantage  of  most  of  these  methods  is  that 
they  do  not  ensure  (2.12),  but  instead  lead  to  a  weaker  condition:  that  the  predicted 
reduction  is  at  least  a  fraction  of  the  predicted  reduction  at  the  solution  of  (2.13). 
When  gke  £lk,  solving  (2.13)  implies  that  there  exists  c [ >0  independent  of  k  such 
that 


where  pk  solves 


predk(pk)>  c \predk(pk) 


(2.14) 


maximize  predk  (p ) 

P=a 8*  (2.15) 

sit  1 1  p  1 1  <  A*  . 

We  refer  to  pk  as  the  Cauchy  step. 

A  much  more  complete  description  of  restricted  subspace  methods  can  be  found 
in  More  [1982], 


2.3.3.  Dogleg  Methods. 

A  third  approach  to  approximating  pk  can  be  motivated  in  several  ways. 

Definition  2.15  :  Let  Q.k  be  a  subspace  of  IRn  with  gke  Q.k.  Consider  a  curve  p{ a) 
in  £lk  with  a  defined  on  [0,1].  This  curve  is  said  to  be  a  generalized  dogleg 
if  it  satisfies  the  following  conditions. 

(a)  p(0)  =  0. 


(b)  gk  is  tangent  to  p  (a)  at  a  =  0. 
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(c)  lip  (a)  1 1  is  strictly  monotone  increasing  in  a. 

(d)  predkip{ a))  is  strictly  increasing  in  a  over  [0,1]. 

(e)  p(l)  is  a  global  maximizer  of  predk  over  Q.k  if  Bk  is  positive  semidefinite,  else 

II P  (1)11  -A*. 

Loosely  speaking,  a  generalized  dogleg  is  a  curve  in  IR” ,  starting  at  the  origin, 
proceeding  in  a  path  that  increases  predk ,  and  terminating  at  the  global  maximizer  of 
predk  (if  such  a  maximizer  exists).  Thus  the  problem 

maximize  predk  ip  (a)) 

«6{0,1]  (2.16) 

s/Hlp(a)ll  <  At 

has  two  possible  solutions  p(  a)  : 

a*  =  1  ,p(a*  )=pl  (2.17) 

or 

a* €(0,1)  ,  llp(a*)ll  =  A*  .  (2.18) 

The  first  occurs  when  p(l)  lies  within  the  trust  region.  For  positive  definite  Bk,  p(l) 

is  simply  the  quasi-Newton  step,  ~Bk^gk,  denoted  pk.  The  second  case  occurs  when 

no  global  maximizer  exists,  or  when  it  is  outside  the  trust  region.  Here,  p  (a  )  is 

simply  the  point  of  intersection  of  p  (a)  with  the  surface  of  the  trust  region. 

Expressed  in  this  fashion,  problem  (2.16),  is  clearly  meant  to  be  similar  to  the 
restricted  subspace  problem  (2.13),  but  with  the  potential  of  requiring  even  less 
computational  overhead.  The  original  dogleg  was  suggested  by  Powell  [1970  a,b]. 


who  used  the  curve 
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2apk  0C<'T 

p(a)=\  \  (2.19) 

2(1-  ct)Pk  +  (2oc  -  1  )Pk 

This  curve  consists  of  two  line  segments.  The  first  starts  at  p  =  0  and  proceeds  to  the 
Cauchy  step  pk  and  the  second  joins  the  Cauchy  step  with  the  quasi-Newton  step  p". 
The  only  matrix  factorization  needed  to  define  p  (a)  and  compute  p  (a* )  is  the 
original  factorization  used  to  compute  pk.  For  secant  methods  that  update  Bk  in 
factored  form,  the  overhead  drops  even  further  to  order  n1.  Further  savings  can  be 
realized  when  the  problem  TRS  must  be  resolved  with  a  different  A* .  OLC  methods 
must  resolve  the  problem  almost  from  scratch,  entailing  even  more  matrix 
factorizations.  Dogleg  algorithms  need  only  compute  the  new  point  of  intersection  of 
p  (a)  with  the  surface  of  the  new  trust  region. 

Dennis  and  Mei  [1979]  present  a  double  dogleg  algorithm  similar  to  Powell’s 
dogleg  method,  but  which  uses  three  line  segments  in  n( a).  Steihaug  [1981] 
develops  a  multi- segmented  dogleg  method,  based  on  the  conjugate  gradient  method. 
This  technique  is  natural  and  attractive  in  that  successive  conjugate  gradient  steps 
maximize  predk  over  a  sequence  of  successively  expanding  Krylov  subspaces. 
Dogleg  procedures  that  take  advantage  of  indefiniteness  are  presented  by  Schultz, 
Schnabel  and  Byrd  [1985]  and  Steihaug  [1981]. 

Although  dogleg  algorithms  have  the  advantage  of  reduced  overhead,  they  may 
generate  steps  inferior  to  those  produced  by  OLC  algorithms.  The  ideal  dogleg 
would  follow  the  curve  of  solutions  to  problem  TRS  as  A*  increases  from  zero.  This 
may  differ  considerably  from  the  path  defined  by  a  dogleg  or  double  dogleg. 
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However,  computational  experience  with  dogleg  algorithms  suggests  that  any  decrease 
in  performance  is  not  significant  enough  to  outweigh  the  dogleg’s  advantages.  At 
present  the  choice  between  dogleg  and  OLC  methods  seems  to  be  a  matter  of  taste. 

Dogleg  algorithms  generate  steps  that  satisfy  slightly  different  conditions  from 
OLC  algorithms.  Taking  the  Dennis-Mei  double  dogleg  (with  Bk  assumed  to  be 
positive  definite)  as  an  example,  the  generated  step  pk  satisfies  equations  (2.14)  and 
(2.15)  rather  than  the  stronger  (2.12).  Furthermore,  we  have  either 

Pk=~(Bk)~l8k  and  Up* H  <4  (2.20) 

or 

lip*  II  =  A*  .  (2.21) 

For  a  more  extensive  survey  of  dogleg  type  algorithms,  including  material  on 

preconditioning,  the  reader  is  again  referred  to  the  excellent  article  More  [1982]. 

2.4.  Trust  Region  Algorithms 

Different  authors  have  created  many  different  versions  of  trust  region  algorithms. 
Although  the  central  features  are  the  same,  some  of  the  details  differ  sufficiently  to 
cause  a  certain  amount  of  fragmentation  in  the  theory  as  presented  in  the  literature. 
Proofs  of  most  results  are  specific  to  a  given  implementation,  and  it  is  often  not 
immediately  clear  whether  a  slightly  different  implementation  is  equivalent.  Much  of 
this  is  the  result  of  sacrificing  generality  for  clarity;  including  the  possibility  of  many 
options  may  obscure  more  important  points  about  an  algorithm  or  proof.  In  this 
thesis  we  present  simplified  sample  algorithms  when  clarity  is  desired  but  otherwise 
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present  theory  for  algorithms  written  to  allow  a  large  number  of  different 
implementations. 


9 

We  now  present  three  representative  trust  region  formulations. 


Algorithm  A.2.1  :  A  Minimal  Version  of  a  Trust  Region  Algorithm 

0)  Let  x  te  IR"  and  A1  >  0  be  given. 

1)  Repeat  until  convergence  : 

2)  Compute  an  approximate  solution  pk  to  problem  TRS. 

aredk  (pk )  i 

3)  If - < - ,  then  set  At  =  At/4  and  return  to  2. 

predk(pk )  100 

4)  Set xk+\—xk+pk  ,  Aj.+i  —  Ak. 

aredk  (pk )  i 

5)  lf - ,  ,  -r  then  set  A*+1  =  2xAk . 

predk(pk)  2 

6)  End. 


This  algorithm  should  be  a  clear  demonstration  of  the  central  idea.  The  trust 
radius  Ak  is  adjusted  at  each  iteration  so  that  the  model  and  function  values  retain  at 
least  a  minimal  amount  of  agreement  over  the  trust  region.  The  measure  used  for 
judging  the  agreement  between  model  and  function  is  the  ratio  of  the  actual  reduction 


2  Trust  region  algorithms  generally  include  several  parameters  that  differ  from  implementation  to 
implementation.  The  sample  formulations  presented  in  this  chapter  do  not  include  the  full  range  of 
parameters  we  will  use  in  Chapters  4  and  5.  In  order  to  keep  our  notation  consistent,  the  subscripts 
used  with  parameters  in  this  chapter  are  defined  to  match  the  subscripts  needed  in  later  chapters.  The 
reader  is  referred  to  Appendix  A  for  typical  values  assigned  to  these  parameters. 
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to  predicted  reduction  for  the  last  step.  No  step  is  accepted  unless  this  ratio  achieves 
a  certain  minimum  value.  We  will  sometime  refer  to  a  step  as  “satisfactory”  or 
“unsatisfactory”  based  on  this  criterion. 

The  approximate  solution  to  problem  TRS  can  be  computed  by  the  methods  of 
the  last  section,  and  is  required  to  satisfy  (2.12)  or  (2.14)  as  well  as  the  matching 
conditions  on  the  norm  of  pk  with  respect  to  A*. 

A  well-known  result  (see,  for  example,  Powell  [1975])  is  that  either  (2.12)  or 
(2.14)  is  sufficient  to  imply  that  there  exists  c  >0  independent  of  k  for  which 

IlgJI 

predk  (pk)> c  II  gk  II  min{  Ak ,  jj—y  }  .  (2.22) 

This  in  turn  implies  that  the  sequence  of  iterates  produced  by  the  algorithm  is  FOSPC 
provided  {Bk  }  is  uniformly  bounded  and  /  satisfies  certain  mild  conditions. 

Now  that  we  have  demonstrated  the  basic  elements,  let  us  examine  the 
formulation  presented  by  Powell  [1984]. 


30 


Algorithm  A.2.2  :  Trust  Region  Algorithm  /  Powell  [1984] 

0)  Let  0<y0<y1<  l<y3,  0 <r)3 <  1,  xx  e  IR"  ,  Aj>0  be  given. 

1)  Repeat  until  convergence  : 

2)  Compute  pk  satisfying  (2.22)  and  either  (2.20)  or  (2.21). 

3)  If  aredk(pk)> 0,  then  set  xk+{ -xk  +pk  and  update  Bk, 

Else  set  xk+l  =xk. 

4)  If  aredk(pk)>r\2  predk(pk)  then  set  Ak+l  e  [  1 1  p*.  II  ,Y3llp*ll  ], 

Else  set  Ak+le  [y0Hp*  II  ,Yi WpkM  l 

5)  End. 


Powell  [1984]  shows  that  this  algorithm  produces  iterates  that  are  WFOSPC 
under  mild  conditions  on  /  provided  there  exist  constants  P2  an(i  P2  f°r  which 

II5JI<P2  +  P2^.  (2.23) 

This  algorithm  differs  from  A. 2.1  in  several  particulars.  A.2.2  updates  the  trust 
radius  based  on  llp^ll  rather  than  A*  as  in  A.2.1.  The  steps  computed  for  A.2.2 
must  satisfy  (2.20)  or  (2.21),  while  A.2.1  allows  the  more  relaxed  bound 
llpitll^(l+CT1)At. 

aredk  (pk ) 

More  importantly,  Powell  allows  the  acceptance  of  a  step  if  - - - >0 

predk(pk ) 


rather  than  demanding  that  the  ratio  be  greater  than  a  positive  constant. 
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Schultz,  Schnabel  and  Byrd  [1985]  give  an  abstract  version  of  a  trust  region 
algorithm  that  admits  most  of  the  trust  region  logic  in  use  today.  Their  algorithm, 
which  is  FOSPC  when  { Bk }  is  uniformly  bounded  and  /  satisfies  mild  conditions,  is 
defined  as  follows. 


Algorithm  A.2.3  :  Trust  Region  Algorithm  /  SSB  [1985] 

0)  Let  Y1,ri1,r|2e  (0,1),*! 6  IR”  ,Aq>0  be  given. 

1)  Repeat  until  convergence: 

2)  Find  Ak  and  compute  pk=pk(Ak)  satisfying  lip*  II  < A*  and 
aredk(pk) 


a) 


->1]!  and 


predk(pk) 
b)  Either  A*>A*_1?  or 

Ak  >  II  Bk}x  gk_x  II  with  Bk_x  positive  definite  or 

For  some  A<  —A*  , 

Yi 

aredk  (pk  (A))  aredk_x(pk_x{A )) 

predk(pk (A))  <  ^  °f  predk_x(pk_x{ A))  <  ^  ' 

3)  Let  xk+x  =xk  +pk  and  k  =  k+ 1  . 

4)  End. 


Rather  than  directly  specifying  a  technique  to  be  used  for  computing  steps  pk, 
Schultz,  Schnabel,  and  Byrd  present  general  conditions  that  steps  must  meet  to  be 
considered  acceptable.  These  conditions  therefore  define  the  allowable  range  of 
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options  in  the  unspecified  portions  of  the  algorithm. 

Much  of  our  multi-model  theory  is  a  generalization  of  the  trust  region  theory  in 
this  paper.  We  shall  abbreviate  all  references  to  this  paper  for  brevity,  calling  it  SSB 
[1985].  Furthermore,  the  general  algorithms  we  present  will  be  abstract  in  the  same 
sense  as  A.2.3;  parts  of  the  algorithm  will  be  left  undefined  except  for  our 
specification  of  certain  conditions  the  algorithm  must  obey. 


CHAPTER  3 


Motivation  and  Nomenclature  for  Multi-model  Algorithms 

3.1.  Introduction 

In  this  chapter  we  examine  some  classes  of  multi-model  algorithms.  Section  3.2 
examines  the  motivations  and  principles  we  use  in  designing  a  multi-model  algorithm. 
Section  3.3  introduces  nomenclature  for  classifying  multi-model  algorithms  into 
different  categories,  points  out  where  the  existing  algorithms  fall  in  such  a 
classification  scheme,  and  gives  a  preliminary  discussion  of  the  advantages  and 
disadvantages  of  algorithms  in  these  categories. 

3.2.  Motivation  and  Goals 

Before  examining  the  theory  or  implementation  of  a  multi-model  algorithm,  we 
should  closely  examine  what  we  hope  to  accomplish.  Multi-model  algorithms 
invariably  require  more  computational  overhead,  storage,  and  code  complexity.  Thus 
it  is  imperative  that  any  such  algorithm  be  examined  closely  to  see  that  it  does  at 
least  something  better  than  either  of  the  corresponding  single  model  algorithms. 
Several  possibilities  suggest  themselves  immediately.  A  multi-model  algorithm  may 
perform  better  in  practice,  it  may  have  better  theoretical  global  convergence 
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properties  than  a  corresponding  single  model  algorithm,  or  it  may  have  better 
theoretical  local  convergence  properties. 

Extensive  testing  of  diverse  multi-model  algorithms  for  NLS  problems  has 
shown  them  to  be  often  superior  to  single  model  algorithms,  even  for  somewhat  ad- 
hoc  incorporation  of  the  additional  models.  Given  this  practical  success,  we  deal 
strictly  with  theoretical  properties  in  this  thesis.  However,  this  theory  must  deal  with 
practical  algorithms,  and  thus  we  must  discuss  what  properties  a  “practical” 
algorithm  should  have. 

One  necessary  property  is  efficiency.  Any  procedure  devised  for  selecting  or 
switching  models  cannot  add  so  much  overhead  that  the  algorithm  is  too  expensive  to 
run.  The  “ideal”  solution  would  be  an  algorithm  that  selects  the  single  best  model  at 
each  iteration  without  any  extra  overhead  or  function  evaluations.  A  “worst  case” 
example  might  involve  computing  and  testing  trial  steps  for  each  possible  model  at 
each  stage  of  the  algorithm.  Such  an  algorithm  might  be  as  follows. 
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Algorithm  A.3.1  :  Sample  algorithm  with  excessive  overhead 
0)  Initialize. 

1)  Repeat  until  convergence  : 

2)  Repeat  until  all  models  have  been  tried. 

Select  a  model. 

Compute  a  trial  iterate  for  this  model. 

Compute  /  for  this  trial  iterate. 

3)  Choose  the  “best”  iterate  computed  above.  If  it  is  “unsatisfactory,” 

then  reduce  the  trust  radius  and  return  to  2). 

4)  Adjust  the  trust  radius. 

5)  End. 


Although  this  process  reduces  the  number  of  external  iterations  --  the  loop  between  1) 
and  5)  --  needed  to  achieve  a  given  tolerance,  it  is  clearly  too  inefficient  to  be  useful 
unless  the  function  evaluations  and  linear  algebra  involved  are  exceedingly 
inexpensive.  Even  if  only  two  models  are  used  and  the  trust  radius  is  never 
decreased,  this  algorithm  doubles  the  number  of  function  evaluations  needed  to 
perform  a  given  number  of  iterations.  For  more  than  two  models,  and  with  a  typical 
frequency  of  trust  radius  reductions,  both  the  number  of  function  evaluations  and  the 
amount  of  numerical  linear  algebra  needed  can  be  a  large  multiple  of  the 
corresponding  single  model  algorithm. 

Another  necessary  goal  is  reliability.  In  looking  at  theoretical  global  or  local 
properties,  the  best  of  all  possible  worlds  would  be  a  model  switching  algorithm  that 
always  behaves  at  least  as  well  as  would  be  expected  for  a  single  model  algorithm 
using  the  “best”  model.  With  this  in  mind,  we  often  examine  how  well  a  proposed 


multi-model  algorithm  can  handle  the  case  when  one  of  its  models  becomes 
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arbitrarily  bad,  and  we  are  most  satisfied  with  algorithms  that  do  not  have  their 
performance  degraded  by  such  models.  As  discussed  in  the  examples  of  Section  1.2, 
such  a  multi-model  algorithm  allows  one  to  include  in  production  code  models  which 
may  sometimes  perform  badly,  or  even  entirely  speculative  models,  without  fear  of 
disaster.  For  a  SSB  [1985]  type  single  model  trust  region  algorithm  such  as  A.2.3, 
FOSPC  can  be  obtained  under  mild  conditions  on  /  if  the  sequence  of  model 
Hessians  is  uniformly  bounded.  For  a  Powell  [1984]  type  trust  region  algorithm  such 
as  A.2.2,  the  condition  on  the  sequence  of  model  Hessians  is  relaxed  slightly,  but  the 
convergence  result  is  weaker  (WFOSPC  rather  than  FOSPC).  In  a  multi-model 
algorithm,  we  consider  it  quite  reasonable  to  require  that  one  model  obey  the  stronger 
condition.  Consequently,  we  put  our  algorithms  in  a  framework  similar  to  SSB 
[1985]  to  obtain  the  stronger  results.  References  to  “global  convergence”  should  be 
interpreted  as  FOSPC  unless  otherwise  stated.  An  algorithm  is  referred  to  as  “safe” 
if  we  can  specify  conditions  on  the  models  which  imply  that  the  algorithm  is  FOSPC. 

Many  of  the  obvious  ways  to  avoid  arbitrarily  bad  models  are  incompatible  with 
the  goal  of  “efficiency.”  Algorithm  A.3.1  above  certainly  satisfies  our  goal  of 
reliability  but  is  not  efficient  in  most  cases.  Another  reliable  algorithm  is  as  follows. 
Let  B  be  a  given  symmetric  matrix. 
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Algorithm  A.3.2  :  Sample  algorithm  which  is  reliable  but  inefficient 

0)  Initialize. 

1)  Repeat  until  convergence  : 

2)  Set  Bk  to  be  a  secant  update  of  Bk_^. 

3)  If  II  Bk  II  >11  5  II  set  Bk  to  B. 

4)  Compute  a  trial  iterate  for  this  model. 

5)  If  it  is  “unsatisfactory,”  then  reduce  the  trust  radius  and  return  to  4). 

6)  Adjust  the  trust  radius. 

7)  End. 


Now,  if  B  is  defined  so  that  it  properly  reflects  the  scale  of  the  problem  over  the 
entire  range  of  iterates,  this  algorithm  is  both  safe  and  efficient.  However,  if  B  does 
not  reflect  the  scale  of  the  problem,  it  can  force  the  algorithm  to  use  a  highly 
inappropriate  model  even  if  the  secant  model  is  performing  well.  As  an  extreme 
example,  consider  using  the  identity  matrix  for  B .  Then  whenever  the  norm  of  the 
secant  update  is  greater  than  one,  Bk  is  set  to  /  and  the  algorithm  reverts  to  the 
method  of  steepest  descent. 

In  analyzing  efficiency,  the  hierarchy  usually  assumed  is  that  evaluating  an  exact 
Hessian  is  much  more  expensive  than  evaluating  a  gradient,  which  is  much  more 
expensive  than  evaluating  a  function,  which  is  more  expensive  than  calculating  a  trial 
step  for  a  given  model,  which  is  more  expensive  than  evaluating  the  predicted 
reduction  for  a  trial  step.  A  major  feature  of  multi-model  algorithms  is  that  they  can 
be  tailored  to  specific  cases  within  this  hierarchy,  or  to  other  hierarchies  altogether. 
For  example,  although  Algorithm  A.3.1  is  quite  inefficient  under  the  standard 
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hierarchy,  it  becomes  reasonable  if  the  expense  of  evaluating  gk  is  extremely  large 
with  respect  to  the  expenses  associated  with  computing  steps  and  evaluating  / . 
Other  special  cases  may  also  make  A.3.1  attractive.  For  certain  classes  of  array 
processors  and  certain  classes  of  objective  functions,  /  can  be  simultaneously 
calculated  for  a  large  number  of  different  x  values  at  little  more  expense  than  a 
single  function  evaluation.  In  such  a  situation  A.3.1  would  be  competitive. 
Flexibility  in  algorithm  design  is  also  needed  to  take  advantage  of  tradeoffs  between 
“expensive”  and  “inexpensive”  models  as  in  Examples  1.2  and  1.3.  Algorithms 
specifically  designed  for  such  cases  are  important  enough  to  warrant  inclusion  in  this 
thesis.  Thus  it  is  imperative  that  our  global  convergence  theory  be  flexible  enought 
to  allow  such  possibilities. 

None  of  these  properties  or  goals  can  be  sacrificed  in  favor  of  the  others  if  our 
theory  is  to  be  realistic.  Chapters  4,  5,  and  6  present  algorithmic  frameworks  flexible 
enough  to  allow  a  wide  range  of  different  implementations  for  different  situations 
while  retaining  both  reliability  and  high  efficiency. 

3.3.  Nomenclature  for  Multi-model  Algorithms 

At  this  point  we  introduce  some  more  nomenclature.  Most  of  it  concerns  the 
switching  system  :  the  scheme  used  to  decide  which  models  to  use  and  how  to  use 
them.  We  define  three  different  types  of  switching  systems  as  follows. 

Definition  3.1  :  A  model  selection  algorithm  is  one  in  which  a  single  model  is 
selected  at  the  end  of  each  iteration  for  use  throughout  the  next  iteration. 
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This  approach  is  very  amenable  to  analysis  with  standard  global  convergence 
theory  since  only  one  model  is  being  used  at  each  iteration.  All  Al-Baali,  Fletcher 
[1985]  and  Nazareth  [1980,1983]  algorithms  are  of  this  type. 

Definition  3.2  :  A  model  switching  algorithm  is  one  in  which  two  or  more  candidate 
models  are  available  to  the  system  at  any  iteration.  The  algorithm  can  use  any 
or  all  of  these  models  to  compute  and/or  test  steps,  and  it  can  change  models 
during  an  iteration  based  on  these  computations. 

NL2SOL  (DGW  [1981])  is  an  algorithm  of  this  type,  and  all  model  selection 
algorithms  are  special  cases  of  model  switching  algorithms. 

Definition  3.3  :  A  concurrent  model  algorithm  is  one  in  which  more  than  one  model 
can  be  used  simultaneously  in  a  given  iteration. 

For  example,  a  possible  algorithm  mentioned  by  Nazareth  (but  not  tried) 
involved  computing  two  steps  --  each  based  on  a  different  model  —  and  choosing  the 
actual  step  to  be  some  convex  combination  of  the  two  steps. 

We  now  discuss  each  of  these  types  in  detail  and  preview  some  of  our  future 
results. 


3.3.1.  Model  Selection  Algorithms 

3.3.1. 1.  The  Algorithm 


Model  selection  algorithms  are  the  most  common  type  in  the  literature. 
Implementations  are  simple  and  reportedly  highly  successful.  Furthermore,  since  only 
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one  model  is  used  at  each  iteration,  global  results  can  be  stated  immediately  about 
many  of  these  algorithms  if  implemented  consistently  with  one  of  the  trust  region 
frameworks  of  Chapter  2,  and  if  the  sequence  of  model  Hessians  selected  is 
uniformly  bounded.  Consider  the  nonlinear  least  squares  problem  (NLS).  Fletcher, 
Al-Baali,  and  Nazareth  hybrid  algorithms  always  deal  with  the  Gauss-Newton  model 
J(xk)‘J(xk)  (which  we  assume  is  bounded  in  k),  and  a  model  produced  by  a  secant 
method.  If  there  exists  a  uniform  upper  bound  on  the  curvature  of  the  models 
produced  by  the  secant  method,  then  most  trust  region  algorithms  are  FOSPC  no 
matter  how  the  models  are  combined.  However,  the  existence  of  such  an  upper 
bound  is  an  open  question.  Furthermore,  if  we  assume  such  a  bound,  we  beg  the 
question  of  what  role  model  selection  actually  plays  in  helping  the  algorithm  achieve 
FOSPC.  The  really  interesting  question  is  whether  any  of  the  tests  in  the  literature 
can  distinguish  between  a  sequence  of  model  Hessians  which  is  uniformly  bounded 
and  a  sequence  which  is  unbounded.  This  question  is  answered  in  Section  3.3. 1.3. 

Algorithm  A.3.3  is  an  example  of  a  model  selection  algorithm  written  as  a  trust 
region  method.  Details  not  pertaining  directly  to  the  use  of  more  than  one  model  are 
suppressed  for  clarity. 
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Algorithm  A.3.3  :  Model  Selection 

0)  Let  x  i  e  R"  be  given. 

Define  an  initial  trial  trust  radius  Ak  >  0,  and  a  set  of  model  Hessians  {B\ }. 

Choose  one  of  the  models  to  be  the  current  preference. 

Let  f  be  the  index  of  this  model. 

1)  Repeat  until  convergence  : 

2)  Compute  a  trial  step  pi  that  approximately  solves  (TRS)  for  the  trial  trust  radius 
Ak  and  current  model  with  index  i  .  Then  either 

(a)  Accept  At  =  A*  and  p‘k ,  or 

(b)  Find  a  reduced  trust  radius  Ak  <  A*  and  compute  a  “satisfactory”  pi  that 
approximately  solves  (TRS)  for  the  currently  preferred  model. 


Set  pk=plk. 


3)  Choose  a  trial  radius  A*+1  for  the  next  step. 

4)  Perform  any  calculations  required  to  make  the  models  available  at  the  next  itera¬ 
tion,  and  select  the  model  Hessian  Bk+l  to  be  used. 


5)  Set  xk+l  =  xk  +  pk,fk+ 1  =  /*  ,  g*+i  =  g(xk+ 1),  k  =  k  +  1. 


6)  End. 


Some  of  the  model  selection  algorithms  in  the  literature  are  implemented  as  line 
search  methods,  but  we  analyze  them  as  if  they  were  implemented  using  trust  regions. 
Most  consider  only  two  models  at  each  iteration.  Algorithm  A.3.3  allows  for  any 
number  of  quadratic  models  to  be  generated  at  each  iteration,  but  only  one  is  actually 
selected.  The  choice  of  the  model  to  be  used  in  the  next  iteration  can  be  made  before 
or  after  the  models  are  updated  or  recalculated.  In  general,  steps  3)  -  5)  can  be  done 


in  any  order  desired. 
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Step  4)  is  phrased  to  allow  the  possibility  of  not  actually  forming  a  model  unless 
it  is  needed  in  the  next  iteration. 

3.3. 1.2.  Tests  for  Choosing  Among  Alternative  Models 

There  are  several  schemes  in  the  literature  for  choosing  a  model  for  the  next  iteration. 
We  first  consider  the  different  tests  used  to  evaluate  the  performance  of  a  model  and 
then  demonstrate  how  the  algorithms  put  these  tests  to  use.  Each  test  is  based  on 
computing  some  measure  of  how  “good”  a  model  is.  This  value  can  then  be  tested 
against  a  fixed  constant  to  classify  the  model  as  “acceptable”  or  “unacceptable,”  or 
two  models  can  be  compared  by  computing  the  “goodness”  measure  for  each  model 
and  comparing  the  values.  Al-Baali  and  Fletcher  [1985]  suggest  a  variational  test 
for  the  nonlinear  least  squares  problem.  This  test  calculates 

Vk{Bi)=  II  U-x-(Bj  r1  \\FtU  (3.1) 

for  a  positive  definite  Hessian  approximation  Bk,  where  U  is  a  BFGS  type  update  of 
B l  and  II  *  llF  £/  is  the  weighted  Frobenius  norm  defined  by 
II  A  llF{/  =  ll  U~mAU~m  II  f  .  Values  close  to  zero  indicate  a  “good”  model. 
The  matrix  U  could  actually  be  any  symmetric  positive  definite  matrix,  but  the  above 
definition  allows  this  test  to  be  interpreted  in  a  natural  fashion.  If  the  test  is  applied 
to  the  matrix  Bk  after  it  has  been  computed  at  the  end  of  iteration  k— 1,  it  can  be 
viewed  as  a  measure  of  “how  close”  the  matrix  B lk  is  to  satisfying  the  secant 
equation.  However,  to  be  meaningful  when  the  model  Bk  being  tested  is  itself  a 
structured  version  of  the  BFGS  secant  model,  the  test  must  be  done  before  the 
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update,  since  otherwise  the  value  of  Vk  for  this  model  is  automatically  zero.  Testing 
before  the  update  leads  to  the  interpretation  that  Vk  measures  “how  close”  the  matrix 
Bj_{  is  to  satisfying  the  secant  equation,  or  (for  the  BFGS  case)  “how  much  the 
model  will  change.”  Fletcher  and  Al-Baali  use  as  their  two  models  the  Gauss-Newton 
Hessian  and  a  structured  BFGS  secant  approximation.  Although  their  implementation 
tests  the  Gauss-Newton  model  after  it  is  generated  and  thus  mixes  the  interpretations 
above,  they  report  a  great  deal  of  success  in  practice.  Their  test  does  not  require  any 
extra  function  evaluations  and  can  be  implemented  without  excessive  amounts  of 
extra  numerical  linear  algebra.  It  should  thus  be  considered  as  a  viable  method  for 
selecting  a  model  for  use  in  the  next  iteration.  The  major  drawback  from  the  general 
point  of  view  is  that  there  exist  many  models  to  which  it  is  not  applicable.  Some  of 
the  structured  models  for  nonlinear  least  squares  do  not  necessarily  remain  positive 

definite,  so  the  test  cannot  be  applied.1  Nonquadratic  models  are  also  untestable  by 
this  measure.  Furthermore,  it  can  only  be  used  for  model  selection  and  not  for  model 
switching. 

Other  tests  used  by  various  authors  compare  the  observed  accuracy  of  different 
models  for  a  given  step.  Given  a  test  step  p ,  a  p  test  calculates  the  ratio 


pUp  )« 


aredk  (p ) 
predjip) 


(3.2) 


1  Although  this  test  can  be  applied  formally  to  positive  semidefinite  matrices  also,  the  meaning  of 
vk  is  ambiguous  in  this  case. 
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Values  close  to  one  denote  a  “good”  model.  This  test  seems  very  natural  given 
the  success  of  trust  region  methods  that  use  this  ratio  to  control  the  trust  radius.  It  is 
not  limited  to  positive  definite  quadratics  but  can  be  applied  to  any  general  model  for 
which  a  predicted  reduction  can  be  computed.  Furthermore,  it  can  be  used  both  for 
initial  model  selection  and  for  model  switching.  The  test  step  p  for  selection  of  the 
trial  model  is  typically  the  iterate  pk  just  computed.  For  model  switching,  p  is 
typically  the  most  recently  computed  trial  step.  A  function  evaluation  is  required 
whenever  this  test  is  used,  but  in  many  cases  this  function  evaluation  would  have 
been  required  anyway. 

Several  implementations  using  this  test  are  investigated  by  Al-Baali,  Fletcher 
[1985].  Their  algorithms  use  the  Gauss-Newton  model  until  p  for  this  model 
becomes  unacceptable  (p  <  0.05)  and  then  switch  to  a  BFGS  model.  Nazareth 
[1980,1983]  uses  this  test  to  control  an  algorithm  that  superficially  looks  different 
from  A. 3.3.  Instead  of  choosing  between  a  finite  set  of  models,  Nazareth  uses  a 
model  Hessian  which  is  a  convex  combination  of  two  other  model  Hessians,  say 

Bh  =  a*  +  (  1  -  a*  )  Bk2  (3.3) 

for  some  cq.  e  [  0  ,  1  ].  For  the  nonlinear  least  squares  problem,  if  one  model  is  the 
Gauss-Newton  model  and  the  other  is  a  structured  secant  method,  this  is  equivalent  to 
choosing  a  scalar  multiple  of  our  secant  approximation  to  the  second  order  term 
S(xk)  (see  Example  1.2).  Although  the  algorithm  draws  from  an  infinite  set  of 
possible  models  rather  than  a  finite  set,  only  one  model  is  chosen  at  each  iteration  so 
that  the  algorithm  is  no  different  in  principle  from  other  model  selection  algorithms. 
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The  balance  between  the  two  original  models  is  initially  set  to  0.5  and  is  adjusted  by 
the  following  procedure. 


Procedure  A.3.4  :  Nazareth’s  Method  for  Choosing  ak . 

a)  Compute  pl-i^Pk-O  and  P*-i(P*-i)* 

b)  If  pjfe_i(p*_i)  is  closer  than  Qk-\(Pk-i) t0  U 

then  bias  toward  model  one  : 

If  cx)t_1  =  0,  set  ak  =  .05. 

If  a*_je  (0,y],  set  ak  =  2ak_{. 

If  ak_l<=  (y,.95),  set  ak  =  (  l+ak_x)/2. 

If  a^_!  e  [.95,1],  set  ak  =  1. 

Else  bias  toward  model  two  : 

If  a*..!  =  1,  set  ak  =  .95. 

2 

If  awe  [— ,1),  set  =2a*_1-l. 

2 

If  ak_x  €  (.05, y),  set  ak  =ak_x!2. 

If  ak_x  e  [0,.05],  set  ak  =0. 

c)  Proceed 


Note  that  rather  than  using  the  p  test  on  the  model  actually  used,  this  method  applies 
it  to  the  two  “base”  models  Bk  and  Bk  and  then  generates  the  model  Bk  to  be  used. 
More  complicated  procedures  are  obviously  possible,  but  since  only  one  model  is 
finally  chosen  at  each  iteration,  any  such  technique  would  still  be  a  model  selection 


algorithm. 
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A  third  useful  test  is  similar  to  the  p  test  in  that  it  compares  predicted  reduction 
to  actual  reduction  for  specific  test  steps.  As  in  the  p  test,  p  is  typically  pk_x  or  pk. 
Given  a  test  step  p ,  an  e  -  test  computes  the  observed  error  : 

ek(p  )  =  I  aredk(p)  -  predk(p)  I  .  (3.4) 

Values  close  to  zero  denote  a  “good”  model.  This  test  is  used  in  NL2SOL  and 

is  the  preferred  test  in  the  model  switching  algorithms  of  Chapter  5.  Its  primary 
difference  from  the  p  test  is  that  it  lacks  a  “ pred  ”  term  in  the  denominator.  Leaving 
out  this  term  eliminates  certain  sensitivity  problems  associated  with  small  predicted 
reductions. 

3.3. 1.3.  Some  Properties  of  these  Tests 

As  previously  stated,  the  really  interesting  question  about  these  tests  is  whether  any 
can  distinguish  between  a  sequence  of  model  Hessians  which  is  uniformly  bounded 
and  a  sequence  which  is  unbounded.  We  now  demonstrate  that  none  of  the  methods 
of  initially  choosing  a  model  discussed  in  the  literature  are  capable  of  always 
rejecting  an  arbitrarily  bad  model. 

Proposition  3.1  :  There  exist  two  sequences  of  model  Hessians,  denoted  {Bk  }  and 
{Bk  \  ,  with  1 1 5/ 1 1  <  (3  3 /or  some  j33  >  0,  and  with  {llfi/ll  }  unbounded  for 
which  the  variational  test  can  select  a  model  from  the  second  group  at  every 
iterate. 

Proof  :  We  prove  this  proposition  by  construction.  Al-Baali,  Fletcher  [1985] 
give  the  following  representation  for  Vk(B)  : 
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Vk(B)  = 


y‘kB  lyk 


yicPk 


ykPk 

-  2  -  +  1 


PkBPk 


(3.5) 


where  the  weight  matrix  U  satisfies  the  secant  equation 


(3.6) 


for  some  nonzero  yk  e  IR".2  Suppose  that  n  >3,  and  define  a  set  of  vectors  {wk  }  in 
IR"  satisfying 


and 


w‘kpk  =0 
w‘kyk=0 


(3.7) 


II  w*  11=1.  (3.8) 

Consider  two  sequences  of  model  Hessians  defined  by 

Bk  =  I  (3.9) 

and 

Bk=I+k\wkwk  .  (3.10) 

Now,  by  the  Sherman-Morrison-Woodbury3  formula,  (fi^2)-1  exists  and  has  value 


{Bk  )_1  =  /  - 


k\ 

1  +k\ 


wkw‘k  . 


(3.11) 


2  The  standard  definition  for  yk  is  gk+\-gk,  so  that  U  interpolates  the  observed  change  in  g ,  but 
other  variations  are  typically  used  for  NLS  to  take  advantage  of  problem  structure. 

3  See,  for  example,  Dennis  and  Schnabel  [1983],  Lemma  8.3.1. 


Then 
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Moreover, 


y‘k(Bk)  V*  =y‘kIyk  - 


k\ 


\+k\ 


wbk 


—  yl  (Bk )  \ 


PkBkPk  =  Pk!Pk  +  k \(p‘kwk)2 
=  PkBkPk  ■ 

Therefore  combining  (3.5),  (3.12),  and  (3.13)  gives 


(3.12) 


(3.13) 


Vk(Bkl)  =  VkWk)  •  (3-14) 

Thus  the  variational  test  cannot  distinguish  between  the  models,  which  allows  model 

two  to  be  selected.  Q 

Proposition  3.2  :  There  exist  two  sequences  of  model  Hessians,  denoted  {Bk )  and 
{Bk),  with  Ili?*1 1 1  <  (33  for  some  p3  >  0  and  with  I H  Bk2  II  }  unbounded  for 
which  both  the  p  test  and  the  e  -test  can  generate  a  sequence  of  models  with 


lim  sup  11  Bk  II  =  <>«  . 


Proof  :  We  prove  this  proposition  by  construction.  Let /( x)  =  -^-x‘x.  Assume 

first  that  the  model  to  be  used  in  step  k  + 1  is  chosen  after  the  models  have  been 
updated.  Consider  two  sequences  of  model  Hessians  defined  by 

Bk  =  2/  (3.15) 

and 

Bk  =  k\ I  -  (fc! - 1 )  *  .  (3.16) 

Pk-iPk-i 


Then 


(3.17) 


pred^iPk- 1)  =  -8kPk-i- \Pk-i  W>Pk-\ 

=  aredk(pk_x )-y  Wpk-i\\} 


but 


predk(pk_x)  =  - \Pk-\ 


k\I  -  (it!- 1) 


Pk-iPk-i 

Pk-iPk-i 


Pk- 1 


(3.18) 


=  aredk  (pk-i). 


Therefore  both  the  p  and  e -tests  would  choose  model  Hessian  Bk  =Bk  over  Bk  even 
though  Bk  is  arbitrarily  bad  in  every  subspace  perpendicular  to  pk_v 

If  the  choice  is  made  before  the  models  have  been  updated,  then  consider  the 
two  sequences  of  model  Hessians 


Then 


Bkl=2l 

Bk2  =  (\  +  (-l)k)k\I+I 


(3.19) 


Predk  (Pk )  = 


aredk  (pk ) 

aredk ( pk)  +  k\  WpkW 


2 

2 


k  odd 
k  even 


(3.20) 


As  before,  both  the  p  test  and  the  e-test  would  select  model  two  at  the  end  of  each 
odd  numbered  iteration,  and  thus  the  algorithm  would  use  Bk  =Bk=2k !/  +/ 
throughout  each  even  numbered  iteration.  Q 

Proposition  3.3  :  There  exist  two  sequences  of  model  Hessians,  denoted  {Bk}  and 
{Bk},  with  I  Ifi*1 1 1  <  p3  for  some  p3  >  0,  and  with  {llfi/ll  }  unbounded  for 
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which  the  p  test  combined  with  interpolation  between  models  as  used  in 
Nazareth  can  generate  a  sequence  of  models  \  Bk }  as  in  equation  (3.3)  with 


lim  sup  1 1  Bk  1 1  =  °o  . 

k— >°o 


(3.21) 


Proof  :  We  prove  this  proposition  by  construction.  Let 


fO c)  =  jx‘x  . 

Consider  two  sequences  of  model  Hessians  defined  by 


(3.22) 


Bkl  =  /(  2-K-l)*) 


and 


(3.23) 


Bk1=I  0  +  (  l-(-l)*  )*!) 
Now,  at  any  iteration  where  k  is  odd,  we  have 


(3.24) 


P*(P)=1  + 


I p  II 


1 


2  -g‘kP-p‘p 


and 


(3.25) 


(3.26) 


P  *(p)=l  > 

so  that  oq.+1  is  biased  toward  the  second  model.  At  any  iteration  where  k  is  even,  we 
have 


Pk(p  )  =  1  +  ( £  !  -  -T-  ) 


pi(p)=l 


1 


-g‘kP  -k'-  Up  II  2 


(3.27) 

(3.28) 


so  that  ak+{  is  biased  toward  the  first  model.  Thus,  since  is  initialized  to  y,  we 

1  3 

have  that  a  alternates  between  —  and  — ,  and 

2  4 
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—  /  k  odd 

Bk  =  4  .  <3-29) 

( k !  +  — )  /  k  even 

L  2 

Thus, 

lim  sup  1 1  Bk  II  =  oo  . 

□ 

After  considering  these  examples,  it  seems  doubtful  that  any  model  selection 
algorithm  that  selects  a  model  solely  on  its  performance  at  the  last  step  can  have  the 
properties  we  want.  Since  we  are  not  allowed  to  switch  models  after  the  start  of  an 
iteration,  the  scheme  used  to  initially  choose  this  model  must  select  the  correct  one. 
We  have  seen  how  easy  it  is  to  produce  pathological  special  cases  to  fool  the  existing 

tests.4  However,  model  switching  algorithms  can  be  constructed  which  do  not  share 
these  (potential)  shortcomings.  With  this  in  mind,  model  switching  algorithms  are  to 
be  recommended  over  model  selection  algorithms. 


3.3.2.  Model  Switching  Algorithms 


Chapters  4  to  7  deal  with  this  type  of  algorithm  in  detail.  Since  switching  models  is 
allowed  after  the  current  iteration  has  been  started,  we  would  expect  that  the  harsh 


4  It  could  be  argued,  with  some  justification,  that  since  the  counterexamples  in  Propositions  3.1, 
3.2,  and  3.3  are  for  a  set  of  “arbitrarily  bad”  models  rather  than  a  set  of  “arbitrarily  bad  models  gen¬ 
erated  by  some  method  of  interest,”  model  selection  logic  might  still  link  with  the  model  generation 
scheme  in  a  manner  guaranteeing  global  convergence  anyway.  A  counterexample  to  this  possibility 
cannot  be  given  for  the  simple  reason  that,  for  methods  of  interest  such  as  the  BFGS,  boundedness  of 
the  Hessian  approximations  is  still  an  open  question.  In  lieu  of  more  knowledge,  we  must  assume  that 
if  there  is  the  possibility  of  one  of  the  secant  methods  generating  a  sequence  Hessians  of  unbounded 
norm,  then  there  is  the  possibility  that  these  sequences  may  have  the  same  structure  as  our  counterex¬ 
amples. 
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requirements  on  correctness  of  the  initial  trial  model  can  be  relaxed.  This  does  prove 
to  be  the  case.  We  show  that  when  properly  constructed,  the  model  switching  portion 
of  these  algorithms  does  play  a  nontrivial  role  in  assuring  global  convergence.  A 
wide  range  of  methods  for  model  switching  and  trust  region  modification  are  allowed. 
Furthermore,  three  major  versions  of  the  basic  algorithm  are  presented  which  explore 
the  tradeoffs  between  restrictions  on  the  models  and  restrictions  on  the  details  of  the 
algorithm.  Type  one  algorithms  require  all  models  to  be  quadratic  with  uniformly 
bounded  curvature  but  have  essentially  no  restrictions  on  the  details  of  the  switching 
system.  Type  two  algorithms  require  at  least  one  model  to  be  quadratic  with 
uniformly  bounded  curvature.  Each  other  model  is  required  to  generate  trial  steps 
satisfying  a  “uniform  predicted  decrease”  condition.  The  switching  system  is 
slightly  more  restricted,  in  that  it  must  limit  the  circumstances  under  which  the  trust 
radius  can  be  reduced.  Several  schemes  are  defined  which  do  this.  Type  three 
algorithms  only  require  one  model  to  be  quadratic  with  uniformly  bounded  curvature. 

3.3.3.  Concurrent  Model  Algorithms 

This  category  of  algorithms  is  a  “catch  all”  for  methods  that  are  more  complicated 
than  selection  or  switching  procedures.  At  present,  the  extra  complications  involved 
in  most  of  these  algorithms  appear  unjustified,  and  too  many  disparate  formulations 
exist  to  allow  a  general  framework.  However,  the  following  examples  are  included 
for  completeness.  The  first  one  is  trivial  but  the  second  is  quite  intriguing. 
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Many  --  or  even  most  --  implementations  of  optimization  algorithms  contain 
logic  which  is  implicitly  or  explicitly  equivalent  to  using  a  concurrent  scheme. 
Example  3.1  :  Any  algorithm  which  makes  some  decision  based  on  whether  the 
norm  of  some  model  Hessian  B  is  greater  than  a  constant  is  equivalent  to  an 
algorithm  making  the  same  decision  based  on  whether  the  norm  of  B  is  greater  than 
the  norm  of  B  where  B  =  p,  /.  Even  the  Armijo-Goldstein  condition  used  for  line 
searches  can  be  considered  a  concurrent  model  algorithm,  where  the  “alternate” 
model  Hessian  (used  only  for  testing  trial  steps  and  not  computing  them)  is  B  =  0. 

Concurrent  model  algorithms  can  also  be  fundamentally  different  from  anything 
previously  considered. 

Example  3.2  :  Rather  than  choosing  one  model  or  the  other  to  compute  a  trial  step, 
this  step  can  be  computed  by  reconciling  the  models.  The  first  model  can  be  used  to 
define  a  region  ft1  =  {  p  :p  satisfies  (2.22)  for  Bk  =  Bkl  }.  A  step  pk  can  then  be 
computed  to  maximize  the  predicted  reduction  of  the  second  model  over  the 
intersection  of  Q1  and  the  trust  region.  To  further  complicate  the  algorithm,  one 
could  also  allow  the  option  of  computing  a  step  to  maximize  the  predicted  reduction 
of  the  first  model  over  the  intersection  of  £22  and  the  trust  region  (where  Q2  is 
defined  in  the  obvious  fashion).  Although  this  is  a  fascinating  idea,  global 
convergence  can  be  guaranteed  by  the  simpler  model  switching  algorithms. 
Furthermore,  an  example  in  Chapter  7  casts  doubt  on  the  local  convergence  properties 
of  this  algorithm,  and  the  existing  techniques  for  solving  the  modified  subproblem  are 
not  as  numerically  efficient  as  the  more  mature  methods  in  existence  for  solving 
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problem  TRS.  Although  some  versions  of  multi-model  algorithms  of  this  type  may 
yet  prove  workable,  our  current  belief  is  that  none  will  prove  competitive  with  the 
other  algorithms  presented  for  unconstrained  optimization.  More  about  using  the 
modified  subproblem  above  in  constrained  optimization  can  be  found  in  Celis, 
Dennis,  and  Tapia  [1984]  .  This  approach  is  also  similar  to  ideas  from  multi¬ 
objective  optimization  (see,  for  example.  Woods  [1985],  Chankong  and  Haimes 
[1983],  or  Ignizio  [1982]). 

3.4.  Further  Useful  Concepts 

Additional  concepts  —  which  are  useful  in  actually  writing  an  implementation  or 
helpful  in  attaining  a  complete  understanding  of  multi-model  methods,  but  which  are 
of  small  importance  to  the  theory  presented  in  later  section  —  are  presented  in 
Appendix  B. 


CHAPTER  4 


Safe  Algorithms  using  Maximal  Restrictions  on  the  Models 

4.1.  Introduction 

This  chapter  analyses  the  global  convergence  properties  of  model  switching 
algorithms  when  all  of  the  models  are  quadratic  and  when  we  are  free  to  assume  as 
many  restrictions  on  the  models  involved  as  we  need.  We  show  that  if  there  exists  a 
uniform  upper  bound  on  the  norm  of  each  model  Hessian  available  to  the  system, 
then  essentially  any  switching  system  gives  first  order  stationary  point  convergence. 
This  generalizes  the  first  order  results  of  SSB  [1985]  to  the  multiple  model  case. 

4.2.  Preliminary  Definitions  and  Results 

The  notation  in  this  chapter  is  largely  an  extension  of  that  described  in  Chapter  2  for 
single  model  trust  region  algorithms.  Superscripts  are  included  to  distinguish  among 
models.  For  definiteness,  we  now  present  (or  review)  all  of  the  notation  necessary 
for  this  chapter.  The  reader  is  again  encouraged  to  make  use  of  the  glossary  in 
Appendix  A. 
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4.2.1.  Local  Models 


Assume  that  at  any  iteration  we  have  a  finite  set  of  symmetric  matrices  {Bk }  , 
i=  representing  approximations  to  the  Hessian  of  /  at  xk.  This  gives  us  Nk 

quadratic  models  in  standard  form 

pred‘k(p )  =  -g‘kp  -  j  p‘  Bkp  (4.1) 

for  use  in  approximating  the  function  in  local  coordinates. 

For  quadratics,  most  algorithms  guarantee  that  any  step  pk  computed  as  an 
approximate  solution  to  problem  TRS  satisfies  one  of  the  following  two  conditions. 
Let  Cj  >0  and  c2>0  be  two  constants  which  are  independent  of  k .  Steps  that  satisfy 


predlipl)  >  c  j  max  predlk(agk) 

a 

sit  II  a gk  II  <  A* 


are  called  FCD  (Fraction  of  Cauchy  Decrease)  steps.  Steps  of  this  type  are  generated 
by  dogleg  and  double  dogleg  procedures,  generalized  doglegs,  and  restricted  subspace 


methods  as  defined  in  Chapter  2.  Steps  that  satisfy 


predlk(pk)  >  c2  max  predk(p) 

P  (4.3) 

s  ll  II  p  II  <Ak 

are  generated  by  OLC  (Optimal  Locally  Constrained)  procedures.  Such  steps  can  be 

computed,  for  example,  as  in  More  and  Sorensen  [1983]. 1  The  matrix  Bk  need  not  be 

nonsingular  or  even  positive  definite.  An  OLC  step  also  satisfies  (4.2). 

1  If  gk  is  zero  at  any  iteration,  some  of  the  published  techniques  may  fail  to  guarantee  (4.3). 
However,  as  we  are  only  concerned  with  FOSPC,  we  consider  gk  =  0  at  any  iteration  to  constitute  suc¬ 
cessful  completion  of  the  algorithm. 
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The  following  lemma  is  well  known. 

Lemma  4.1  :  If  p j  is  an  FCD  step  with  constant  Cj  >0  or  an  OLC  step  with  constant 
c2> 0  for  a  quadratic  model  in  standard  form  with  model  Hessian  Bf  then 
there  exists  c3  e  (0,1)  independent  of  k  such  that 


predlk  ( plk )  >  c3  ll&ll  min \ 


H  & 


. 


bl  II 


(4.4) 


More  specifically,  every  c3<  —  cx  satisfies  (4.4)  for  an  FCD  step,  and  every 
c3  <  y  c2  satisfies  (4.4)  for  an  OLC  step. 


Proof  of  this  lemma  can  be  found,  for  example,  in  Powell  [1975].  This  is  also  a 
special  case  of  Theorem  6.4. 

If  1 1  Bl  1 1  is  bounded  as  k  — » <*>,  the  sequence  of  models  is  said  to  have 
uniformly  bounded  curvature.  By  Lemma  4.1,  if  each  p[,  is  an  FCD  or  OLC  step 
and  II  Bj.  II  <(32,  then 


(4.5) 


For  a  given  positive  definite  model  Hessian  B‘k  and  a  sufficiently  large  trust 
region,  most  implementations  generate  a  step  to  the  global  minimizer  of  the  model.  A 
quasi-Newton  step  is  a  step  p)  that  satisfies 


BkPk  =  ~  8k 


(4.6a) 
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For  large  sparse  systems,  implementations  often  do  not  require  exact  computation  of 
this  step  but  instead  impose  the  following  condition  on  how  accurately  the  next 
iterate  should  be  calculated.  A  step  p}  is  said  to  satisfy  the  inexact  quasi-Newton 
condition  if 

11  BicPi  +8k  11  ^  C*  WgkW  (4-6b) 

where  0  <  C,k  <  1  and  lim  C,k  =  0.  More  about  this  condition  can  be  found  in 

k— 

Steihaug  [1980].  Any  quasi-Newton  step  automatically  satisfies  the  inexact  quasi- 
Newton  condition. 

Most  versions  of  trust  region  algorithms  enforce  one  of  the  following  sets  of 
conditions  relating  the  size  of  a  computed  step  to  A* .  When  a  restricted  subspace  or 
a  dogleg  type  scheme  is  used  to  approximately  solve  the  trust  region  subproblem,  a 
step  satisfies  either 

1 1  pi  1 1  =Ak  (4.7a) 

or 

(  (4.6a)  or  (4.6b)  )  and  1 1  p*  1 1  <  A*  .  (4.7b) 

That  is,  each  step  is  either  on  the  boundary  of  the  trust  region,  or  it  is  a  quasi- 

Newton  (or  inexact  quasi-Newton)  step.  An  OLC  step  satisfies  either 

(  1  —  CTj  )  Ak  <  1 1  pi  1 1  <  (  1  +  (Jj  )  Ak  (4.8a) 

or 

(  (4.6a)  or  (4.6b)  )  and  I  Ip*  1 1  <  (  1  -  CTj  )  A*  .  (4.8b) 

In  either  event,  and  for  all  other  reasonable  implementations,  a  trust  region 


algorithm  enforces 
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lip/ II  <  (  1  +  CTj  )  A*  (4.9) 

for  some  constant  e  [  0  ,  1). 

Each  of  our  models  and  associated  procedures  for  computing  steps  may  differ 
considerably  in  the  conditions  they  guarantee.  In  order  to  avoid  overspecifying  our 
algorithm  at  this  point  we  introduce  the  following  notation. 

Definition  4.1  :  We  use  the  notational  shorthand  pkeA(i,Ak)  to  denote  any 
approximation  to  the  solution  of  the  trust  region  subproblem  that  is  acceptable 
with  respect  to  the  model  with  index  i  and  the  trust  radius  Ak .  If  the  index  of 
the  model  is  clear  from  context,  we  may  also  use  the  notation  pk  e  A(A*). 

The  precise  meaning  of  “acceptable”  will  be  specified  separately  for  each  class  of 
algorithms. 

4.2.2.  The  Standard  Assumptions  on  the  Function 

The  results  and  definitions  of  the  previous  section  applied  strictly  to  the  models  being 
used.  Other  than  the  occasional  use  of  gk ,  no  reference  has  been  made  to  the  actual 
function  being  approximated.  In  this  section  we  present  notation  related  to  the 
function  itself.  The  expression 

aredk(p)  =  f  (xk)  -  f  (xk  +  p)  (4.10) 

again  denotes  the  reduction  in  /  caused  by  a  step  p.  For  the  i—  model,  the  ratio  of 

actual  reduction  to  predicted  reduction  is  defined  as 
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P  k(P)  = 


aredk  (p ) 


(4.11) 


predk(p) 

Next,  recall  that  L(f  pc  denotes  of  the  level  set  of  /  at  x 
Definition  4.2  :  For  given  jcje  1R"  and  /  :IR”  — ^IR1,  the  following  conditions  are 
referred  to  as  the  standard  assumptions  on  the  function. 

(SA1)  /  is  twice  continuously  differentiable  on  an  open  convex  set  containing 

L(f  pc  x)  .  This  open  set  is  denoted  L(f  pc  j). 


(SA2)  /  is  bounded  below  on  L(f  pc  {)  . 

(SA3)  3  Pi  >0  such  that  for  all  xeLifpc j),  the  Hessian  of  /  satisfies 

\\H(x)\\  <p!  . 


4.2.3.  Some  Useful  Results 

The  next  lemma  is  invaluable  in  establishing  the  existence  of  acceptable  steps  in 
specific  stages  of  an  algorithm.  Let  01,03  be  defined  so  that  Gje  [0,1),  £36  (0,1), 
and  consider  the  following  conditions. 

(a)  II  p  II  <(  l+aj)A*  . 

IlgJI 

(b)  predi(p)>c3\\  gk\\  min  {A*,— —  }  . 

II  B  £  II 

Lemma  4.2  :  For  a  given  x }  e  1R" ,  let  f :  IR"  — >  IR1  satisfy  the  standard  assumptions 


and  suppose  that  for  some  k ,  the  vector  xk  satisfies  xk  e  L(f  pcf)  and 
g(xk  )*0  . 
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Then  there  exists  8*°>0  such  that  Ake  (0,8*]  and  condition  (a)  imply 


xk  +p  &L(f  pcx)  . 


(4.12) 


Furthermore,  let  Bk  be  the  model  Hessian  of  a  quadratic  model  in  standard 
form.  For  any  r|  e  (0,1) ,  let 


5/  = 


2(l-ri)c3  II  g*  II 
(l+ai)2( pi+  II  Bk  II  ) 


(4.13) 


If  Ak  e  (0 ,8/  ]  and  xk+p  e  L(f  x{)  ,  then  (a)  and  (b)  imply 


2-t)  >p|(p)>ri  . 


(4.14) 


Proof  :  From  the  hypotheses  and  the  definition  of  (SA1),  xk  is  in  the  open  set 
L{f  pcf),  which  immediately  implies  the  existence  of  a  positive  5k  for  which  Ak  <hk 
and  (a)  imply  (4.12). 


Next,  from  Lemma  2.6  and  assumptions  (SA1),  (SA3),  (4.12),  (a),  and  (b)  we  have 
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1-P*(P)I  = 


aredk  (p )  -  predlk  (p ) 


predlk  ip ) 


j(p)‘(Bl-H(xk  +Zp))(p){l-Z>)dZ> 


c 3  Hg*  II  rnin «  A*  , 


H  foil 


BL 


1 


p  I!  2(  II  II  +P( ) 


2 c3  II  g*  II  min  <  A*  , 


"ft"  1 


BL  II 


From  (a)  we  then  have 


I  1-P*(P)I  £ 


(llBill+PiKl  +  Oi)2^2 


2c3  llg*ll  min <  A*  , 


1 


Bl  II 


Since  c3<  1  and  Ak<dk,  (4.13)  implies 


< 


Hft 'I 

II  BL  II 


From  (4.13),  (4.16)  and  (4.17)  we  obtain 


(II  BL  II  +(31)(1+q1)2(A,)2 
2c3  llg*ll  A* 

(II  5‘  ll+pod+ao2  . 

2c3  11**11  * 


1-P l(p)  I  ^ 


<  (1-Tl) 


(4.15) 


(4.16) 


(4.17) 


(4.18) 


which  establishes  (4.14).  Q 
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Consider  a  sequence  {jc*  }  e  JR” .  For  every  m  >  0,  define  Q.m  to  be  the  set 


Qm  =  {  x\Wx-xm  II  < 


IlsJi 

2P, 


and  x  e  L  (f  jc  t)  }  . 


(4.19) 


Lemma  4.3  :  Iff  satisfies  the  standard  assumptions,  then  for  any  x  e  Qm , 


j  II  gm  II  ^UsOOH  ^  j  Ugm 


(4.20) 


Proof  :  If  xm4.L(f  ocx),  then  Qw  is  empty,  so  that  (4.20)  is  immediately 
established.  Consider  any  m  with  xm  e  L(f  For  any  x  e  L(f  ^Xj),  the  standard 
assumptions  and  Lemma  2.5  imply  II  g  (x)  -  gm  II  <  II  x  -xm  II  .  By  the 
triangle  inequality, 

llgmll+llg(x)-gmll  >llg(x)ll  >  HgMll-llg(x)  ~gmW  (4.21) 

soifllx-xmll  <11  gm  II  / 2j3t  ,  then 

yll  gm  II  >llg(x)ll  >  y  H&JI  (4.22) 

which  establishes  the  result.  Q 

Next,  consider  a  sequence  of  positive  numbers  {A*},  and  let  pk  be  defined  by 
pk  =xk+l-xk.  Furthermore,  consider  the  following  conditions. 


(i)  There  exists  ti1,c3€  (0,1)  and  P2  >  0  such  that  each  pk  satisfies  the  uniform 
decrease  condition  : 


aredk(pk)  >  TL  c3  II  gk  II  min 


(4.23) 
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(ii)  There  exists  c^e  [0,1)  such  that  each  pk  satisfies  II  pk  II  <  (  1+dj  )A*  . 

The  next  theorem  establishes  FOSPC  for  [xk }  satisfying  (i)  and  (ii),  provided  the 
sequence  {xk }  does  not  remain  within  a  certain  ball  about  any  given  iterate  xm . 
Proof  of  this  result  can  essentially  be  found  in  SSB  [1985],  although  the  theorem  was 
not  directly  stated  therein. 

Theorem  4.4  :  Let  f  satisfy  the  standard  assumptions,  and  assume  that  (i)  and  (ii) 
are  true.  If  for  every  m  >0,  either 

(a)  gm  =0,  or 

II  gm  II 

(b)  z{m  >m  such  that  II xm~xm  II  >  — — - —  , 

2  Pi 

then  gk  — >0  (FOSPC). 


Proof  :  First  notice  that  (i)  implies  aredk(pk)> 0,  so  that  {xk  }  satisfies  the 
descent  condition.  Hence  xk  e  L(f  jcf)  for  all  k  >0  and  f  (xk+i)<f  (xk)  for  all  k  >0. 
Now,  consider  any  m  with  II  gm  II  0.  By  assumption,  z$m  >m  such  that  x~  dClm. 
Let  /+1  denote  the  first  such  index  after  m  with  xt+l  not  in  Q.m .  Then 


8n 


2  Pl 


<  II  xM-x„ 


S  flip, 

k=m 


(4.24) 


<  (1  +CTi  )  ^  » 

k=m 


so  that  Lemma  4.3  implies 
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f(xm)  ~f  (xM)  =  £  /(**)“/  (xk+ 1) 


k=m 


*  £  Tit  c3\\gkW  min -j 

k=m 


A  k  > 


H** 


l  i 

^  c3  '^m11  min'! 


k=m 


At  , 


P2 

II** 


(4.25) 


2(32 


Now,  if  <  II  gm  II 12  p2  for  m  <  k  <  l  ,  (4.24)  and  (4.25)  imply 


f(xm)~f(xM)  >  y  r| i  c3  ll&JI  £A* 


k=m 


1  II  II  1  8m _ 

-  2  3)1  C3  8  m  2p1(l+CT1) 


Otherwise, 


We  can  then  write 


f  (xm)  -  f  (xl+l)  >  ri!  c3 


f(xm)-f(xl+ 1)  > 


Til  <^3 
4 


8m 


min  ■< 


Pi  (1+0!)  ’  P2 


(4.26) 


(4.27) 


(4.28) 


Now,  since  /  is  bounded  below  and  {/(**)}  is  nonincreasing,  {/(**)}  converges  to 
some  limit,  say  /(**)->/*.  Thus  for  any  m ,  either  gm  =  0  or 


■Um11 


Tli  c3 
4 


1  J_ 

PiO  +  tJi)  P2 


( f(xm)-f *  )  . 


(4.29) 


Therefore  gk  h>0  ,  which  completes  the  proof.  Q 
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4.3.  Introductory  Examples 

Before  considering  a  general  definition  of  the  algorithm,  let  us  write  two  simple 
examples.  We  follow  the  “minimal  version”  trust  region  Algorithm  A.2.1,  but 
construct  it  for  Example  1.2  of  Chapter  1.  Recall  that  in  this  example  we  sketched  a 
method  (Algorithm  A.  1.5)  that  makes  use  of  a  constant  matrix  for  the  model  Hessian 
when  computation  of  the  exact  Hessian  is  too  expensive  to  be  done  frequently. 


Algorithm  A.4.1  :  Constant  Matrix  versus  an  Exact  Hessian 

0)  Let  Xi  e  IRn,Aj>0  and  a  symmetric  B  €  1R"X"  be  given. 

1)  Repeat  until  convergence  : 

2)  Set  Bk=B. 

3)  Compute  pk  e  A  (A*). 

4)  If  aredk(pk)< 0,  then  set  Bk  =  H (xk ). 

5)  If  then 

If  0  <  1 1  pk  1 1  <  A*  then  set  A*  =  !  I  pk  1 1  /4, 

Else  set  A*  =  A*/4. 

Return  to  3). 

6)  Set  xk+l  =xk  +pk  ,  Ak+X  =  A* . 

7)  If  p k{pk)>~,  then  set  A*+1  =  2A*  . 

8)  End. 


Model  switching  is  done  in  step  4)  by  the  simple  expedient  of  evaluating  H  (xk ) 
whenever  a  test  step  generated  by  the  constant  model  yields  a  function  increase.  This 
represents  just  one  possible  way  of  deciding  whether  to  compute  the  full  Hessian.  In 
the  next  algorithm,  an  addition  is  made  to  this  test  so  that  it  is  only  performed  when 
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the  trust  radius  is  less  than  a  fixed  tolerance.  A  slightly  different  version  of  A.  1.5  is 
used,  so  that  the  most  recently  computed  Hessian  is  always  used  for  Bk . 


Algorithm  A.4.2  :  Exact  Hessian  evaluated  intermittently 

0)  Letxte  IR”,A1>0  and  a  symmetric  Bx  &  JR"*'1  be  given. 

1)  Repeat  until  convergence  : 

2)  Compute  pke  A( A*). 

3)  If  Ak  <  KT6  and  aredk(pk)< 0,  then  set  Bk=H (xk). 

4)  If  P*(P*)<-^>  then 

If  0  <  II  pk  1 1  <  A*  then  set  A*  =  1 1  pk  1 1 14, 

Else  set  Ak  =Ak/4. 

Return  to  2). 

5)  Set  xk+x -xk  +pk ,  A*+1  =  A*. 

6)  If  PkiPk)^^  then  set  A*+1=2  Ak  . 

7)  End. 


This  implementation  refreshes  the  constant  matrix  whenever  the  trust  radius 
drops  beneath  an  arbitrary  lower  bound  and  a  computed  step  does  not  decrease  the 
function  value.  Any  other  way  of  model  switching  or  selection  could  also  be 
implemented  without  affecting  the  theory  of  this  chapter,  such  as  refreshing  every  m- 
iteration,  refreshing  when  H(xk)  is  positive  definite,  or  refreshing  when  Hg*ll  is 
smaller  than  a  certain  value. 

These  two  sample  algorithms  are  probably  only  useful  for  unusual  classes  of 
problems  involving  special  structure  or  extremely  expensive  Hessian  or  gradient 
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evaluations.  In  actual  implementation,  the  precise  application  and  any  knowledge 
about  the  relative  computational  costs  should  suggest  a  particular  technique  for 
choosing  a  model.  If  not,  we  suggest  refreshing  Bk  whenever  one  of  the  following 
holds. 

(1)  The  trust  radius  drops  to  less  than  a  fraction  of  the  largest  trust  radius  used  since 
the  last  instance  of  refreshing  Bk. 

(2)  An  iterate  xk  seems  to  be  near  enough  to  the  solution  so  that  Newton’s  method 
is  quadratically  convergent  without  step  size  control. 

One  should  also  include  a  more  sophisticated  trust  region  update  strategy.  Such 
a  strategy  should  include  the  capability  of  rapidly  increasing  A*  after  refreshing  Bk . 

4.4.  The  Basic  Algorithm  Framework 

We  first  define  a  general  framework  for  a  model  switching  algorithm.  Most  details 
are  initially  unspecified  to  emphasize  the  points  at  which  choices  can  be  made  in 
designing  an  implementation.  We  document  exactly  what  choices  are  allowed  in 


Algorithms  AS.4.4  and  AS. 4.6. 


Algorithm  A.4.3  :  General  Model  Switching 


0)  Let  k  =  \,  xj  e  IR"  ,  a  trial  trust  radius  Aj>0  ,  a  set  of  models  indexed  by 
i  =  1,2,  ...  , N i  ,  and  an  initial  trial  model  with  index  i  all  be  given. 

1)  Repeat  until  convergence  : 

2)  Compute  an  acceptable  step  pk  as  follows  : 

Either 

2.1)  Find  a  trust  radius  A*  >  A*,  a  model  with  index  i ,  and  a  step  pk 

for  which  pk  =  pk  e  A  (/  ,A* ), 

Or  _ 

2.2)  Compute  a  trial  step  pk  e  A(i,Ak).  Then  either 

(a)  Accept  pk  =  pk,  Ak  =  Ak,  and  go  to  3)  ,  or 

(b)  Proceed  to  2.3)  for  trust  radius  reduction  ,  or 

(c)  Choose  a  different  trial  model  and  repeat  2.2). 

2.3)  Find  a  trust  radius  A*  <  A* ,  a  model  with  index  i ,  and  a  step  pk 

for  which  pk  =  plk  €  A  (i  ,Ak ). 

3)  Choose  a  trial  radius  A*+1  for  the  next  iteration. 

4)  Perform  any  calculations  required  to  make  the  models 

{  pred[+l  }  ,  j=\,2,...JVk+l  available  at  the  next  iteration. 

Select  an  initial  trial  model  with  index  i  for  the  next  iteration. 

Set xk+i  =xk  +pk  ,fk+1  =fk  ,  gk+l  =  g(xk+l )  ,k=k  +  1. 


5)  End. 


In  this  most  basic  form  of  the  algorithm,  we  have  not  specified  several  things. 

(1)  How  to  choose  among  the  options  in  step  2). 

(2)  How  to  implement  whichever  one  we  choose. 
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(3)  How  to  choose  a  trial  radius  for  the  next  iteration. 

The  allowable  methods  for  doing  these  steps  differ  depending  on  what  is  known 
about  the  models.  For  standard  quadratic  models  with  uniformly  bounded  curvature. 
Section  4.5  shows  the  choices  allowed  in  the  logic  for  deciding  which  option  to 
execute  in  step  2),  and  Section  4.6  presents  the  allowable  logic  for  step  3). 

Typical  values  for  the  parameters  used  throughout  the  rest  of  this  thesis  can  be 
found  in  Appendix  A. 

4.5.  Computation  of  an  Acceptable  Step 

Step  2)  in  A.4.3  is  usually  referred  to  as  the  “Internal  Loop”  section  of  the 
algorithm.  The  bulk  of  model  switching  (as  opposed  to  model  selection)  logic  must 
lie  in  this  section.  Our  theory  does  not  put  any  restrictions  on  the  initial  choice  of  a 
model,  nor  does  it  require  2.2)  to  be  tried  before  2.1).  The  following  specifications 
are  sufficient  to  define  the  range  of  allowable  options  in  step  2). 
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Algorithm  Specification  AS.4.4  :  Permissible  Choices  in  Step  2)  at  the  k— 
Iteration 

Let  0<r|1  <t|2<  1  ,  a  trial  trust  radius  A*  >0  ,  and  a  set  of  Nk  models  be  given. 

2.1)  We  are  allowed  to  choose  A*  >A*  and  find  pi  for  some  model  with  index  i 

satisfying  both  pk  =  pi  e  A  (i  ,A* )  and  p*(pi)  >ni  whenever  such  a  A*  ,  model 
,  and  corresponding  step  can  be  found. 

2.2)  We  are  allowed  to  accept  A*  =  A*  and  find  pi  for  some  model  with  index  i 

satisfying  both  pk  =  p‘k  e  A  (i  ,A* )  and  pl(pi)  >  r| ,  whenever  such  a  A*  ,  model 
,  and  corresponding  step  can  be  found. 

If  at  least  one  model  with  index  j  and  associated  trial  step  pie  A  (j ,Ak)  computed 
in  phase  2.2)  for  the  trial  trust  radius  a*  satisfies  p {(p{)  <  q2,  then  we  are 
allowed  to  execute  step  2.3). 

2.3)  Find  some  A*e(0.A*),  model  with  index  /,  and  pk  =  pi  e  A  (/  .A* )  satisfying 

P iipi)  s  hi- 


We  now  consider  this  algorithm  in  more  detail. 

4.5.1.  Step  2.1  :  Internal  Doubling 

No  requirements  are  made  on  model  switching  at  this  stage.  The  initial  model  can  be 
retained  or  any  new  model  can  be  chosen,  as  long  as  the  final  choice  of  model,  step, 
and  trust  radius  satisfy  both  pk  =  p‘k  e  A  (i  ,Ak )  and  p  lk{pk)  >  r|}  . 

Step  2.1)  is  usually  referred  to  as  “internal  doubling”  because  a  typical 
implementation  successively  doubles  Ak  as  long  as  p/(p/)>T|3  for  some  r)3  e  (r|2, 1 ). 
Another  option  this  step  allows  is  that  of  always  trying  a  full  quasi-Newton  step 


before  computing  a  constrained  step. 
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Many  implementations  do  not  include  steps  of  this  type,  and  most  of  the 
algorithms  we  present  here  specify  that  such  “internal  doubling”  steps  are  optional. 
However,  this  step  makes  many  line  search  procedures  admissible  to  our  theory. 
Furthermore,  inclusion  of  such  logic  is  particularly  appropriate  for  many  model 
switching  applications.  Consider  a  secant  algorithm2  that  recomputes  the  exact 
Hessian  whenever  the  secant  approximation  leads  to  a  suspiciously  small  trust  radius. 
If  refreshing  the  secant  model  with  the  exact  Hessian  does  indeed  improve  the  model 
significantly,  we  would  expect  that  a  much  larger  trust  radius  can  be  successfully 
used.  Internal  doubling  allows  the  trust  radius  to  increase  immediately  by  a  sequence 
of  minor  iterations3  thus  avoiding  the  extra  gradient  evaluations  associated  with  a 
sequence  of  major  iterations. 

4.5.2.  Step  2.2  :  Acceptance  of  trial  trust  radius 

If  step  2.1)  is  not  executed,  step  2.2)  must  be  tried  before  2.3).  Again  there  are  no 
restrictions  on  how  many  or  how  few  of  the  models  we  use  to  compute  trial  steps.  A 
step  is  first  computed  using  the  initial  trial  model  and  trust  radius.  If  this  step 
satisfies  p ^(p^)  >  r) L ,  then  it  can  be  accepted  and  the  algorithm  can  proceed  to  step 
3).  Other  options  are  allowable,  however.  The  algorithm  can  try  switching  models  to 
see  whether  others  perform  better.  It  can  revert  to  step  2.1)  if  a  sufficiently  good 
model  can  be  found. 

2  Such  as  the  MINPACK  algorithm  HYBRID  of  Mor<5,  Garbow,  and  Hillstrom  [1980]. 

3  Each  minor  iteration  involves  only  linear  algebra  and  a  function  evaluation. 
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4.5.3.  Step  2.3  :  Internal  decrease  of  trust  radius 

The  only  real  restriction  on  the  logic  of  AS. 4.4  deals  with  the  cases  for  which  step 
2.3)  can  and  cannot  be  executed.  It  can  be  tried  if  in  the  course  of  step  2.2)  any 
model  is  found  for  which  p j,(pj)  <r\2.  It  cannot  be  tried  if,  so  far,  every  plk  tried  in 
step  2.2)  satisfies  p lk(plk)  >  % .  The  only  circumstance  where  it  must  be  tried  is  when 
every  available  model  has  been  used  to  compute  a  step  in  step  2.2)  and  for  each 
one  p liPk)  <  Tli- 

These  conditions  are  the  most  general  ones  allowed  by  our  theory  and  are  much 
more  relaxed  than  most  implementations  allow.  In  practice  an  algorithm  should  try  to 
avoid  executing  2.3)  whenever  possible,  since  smaller  trust  regions  slow  convergence. 
On  the  other  hand,  excessive  switching  of  models  in  step  2.2)  leads  to  extra  overhead 
and  function  evaluations.  Any  given  implementation  is  a  tradeoff  between  these  two 
considerations.  Typically,  the  trust  radius  is  not  reduced  if  the  initial  trial  model  and 
step  satisfies  p*(p*)  -  Tib  and  no  alternative  model  is  tried  if  p*(p*)  -  *12- 

If  step  2.3)  is  executed,  it  must  contain  a  procedure  for  actually  calculating  a 
reduced  trust  radius  and  step.  The  following  is  a  version  of  the  backtracking 
procedure  typically  used  in  single  model  trust  region  algorithms.  This  procedure  is 
what  gives  Algorithm  AS.4.4  the  name  “internal  loop.” 
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Procedure  A.4.5  :  Internal  Decrease  of  Trust  Radius 

Let  0<y1<l  ,  r^e  (0,1),  a  trial  trust  radius  A*,  and  a  set  of  Nk  models  all  be 
given. 

Initialize  A*  €  (0 ,  Ak  ]. 

2.3.1)  Setts  (0,Yj]. 

2.3.2)  Set  Ak=xAk. 

2.3.3)  Choose  a  model  with  index  i  and  compute  pk  e  A(i,Ak). 

If  pk(pk)  >  r\i  then  accept  pk  as  pk  and  proceed  to  step  3). 

If  not,  either  repeat  this  step  for  another  model  (one  that  has  not  yet  been 
used  for  this  A*),  or  return  to  step  2.3.1). 


Some  very  important  points  should  be  made  about  this  procedure.  First,  AS.4.3 
allows  this  procedure  to  start  with  at  Ak  =  \\  pkW  if  Wpk  II  e  (0,  A*  ].  Such  an 
initialization  is  often  done  in  practice  when  the  step  computed  in  phase  2.2)  is  a  full 
quasi-Newton  step  in  the  interior  of  the  trust  region. 

Second,  let  us  consider  the  possibility  of  switching  model  preference  during 
Procedure  A.4.5.  This  is  an  important  practical  point.  It  was  found  in  the 
development  of  NL2SOL  (DGW  [1981])  that  performing  model  switching  during 
internal  reduction  significantly  degraded  performance,  in  that  it  added  excessive 
overhead.  It  is  thus  important  that  our  theory  cover  algorithms  that  do  not  perform 
switching  at  this  time.  On  the  other  hand,  a  reasonable  implementation  might  be  to 
try  such  switching  if  a  large  number  of  internal  reductions  are  done  in  a  row  (as  is 
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done  in  Dennis,Sheng,Vu  [1985]).  Our  theory  must  cover  this  case  also.  Therefore 
Algorithm  AS. 4.4  neither  requires  nor  prohibits  switching  of  models  within  the  loop 
to  decrease  A* . 

Third,  for  a  set  of  quadratic  models  in  standard  form,  this  procedure  terminates 
in  a  finite  number  of  iterations.  Consider  the  following  conditions  defining  an 
“acceptable”  step  with  respect  to  a  model  with  Hessian  Bk. 

(A)  pj.eA(i,Ak)  implies  that,  for  some  c3e  (0,1),  p k  satisfies 

f  II  gk  II  ] 

predlk  ( pj )  >  c3  II  gk  II  minj  Ak  ,  y  |  • 

(B)  (i ,Ak )  implies  II  pj  II  <  (  1  +  )  A*  for  some  CTj  e  [0, 1 )  independent  of 

k. 

We  then  have  the  following. 

Theorem  4.5  :  For  a  given  jtjelR",  let  f  :  IRn  -^IR1  satisfy  the  standard 
assumptions.  For  a  given  k>  1,  let  xk  satisfy  xk  &  L(f  yc \),  and  let 
{ Blk  }  ,/=l,2,  ...  Nk  be  the  set  of  Hessians  ofNk  quadratic  models  in  standard 
form. 

If  conditions  (A)  and  (B)  above  hold,  then  Procedure  A. 4. 5  terminates  in  a 
finite  number  of  steps. 

Proof  :  From  Lemma  4.2,  there  exists  a  5  >  0  such  that  Ak<&  and  pke  A  ( i  ,Ak ) 
imply  p|(pi)  >  ri!  for  i  =  1,2,  ...,  Nk.  Now,  if  6  >  A*  at  the  start  of  the  procedure, 
p/(p/)>T|1  for  every  i.  If  not,  then  at  most  logC^-VlogCYi)  applications  of  2.3.1)  are 
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needed  to  imply  pl(pj)>T]x  for  every  i,  and  at  most  Nk  executions  of  2.3.3)  can  be 
performed  per  application  of  2.3.1).  Since  <  1,  this  implies  our  result.  Q 

This  result  cannot  be  taken  for  granted  in  algorithms  that  allow  for  more  general 
models.  Any  alternative  to  Procedure  A.4.5  is  acceptable  as  long  as  it  obtains  a  final 
result  that  satisfies  0  <  A*  <  A k,  pk  =plk  e  A  (i, Ak)  and  pk(pk)  >  rjj  for  some  model 
with  index  i. 

4.6.  “External”  Modification  of  the  Trust  Radius 

We  now  define  the  allowable  logic  for  step  3)  of  A.4.3. 


Algorithm  Specification  AS.4.6  :  Permissible  Choices  in  Step  3)  at  the  k— 

Iteration 

Let  y3  >  1  ,  0  <  ri2  <  1  be  given. 

3)  Choose  a  new  trial  trust  radius  for  the  next  iteration  : 

3.1)  Do  one  of  the  following  : 

3.1.1)  Choose  an  increase  factor  x  e  [l,y3], 

3.1.2)  Or,  if  phase  2.1)  was  not  executed  ,  and  if  for  some  model  and 
associated  step  pi e  A(J,Ak )  from  phase  2.2)  or  p{  e  A  (/.A*)  from 
phase  2.3)  we  have  plipl )  <  r|2,  then  we  are  allowed  to  choose  a 
decrease  factor  x  e  (0,1]. 

3.2)  Do  one  of  the  following  : 

3.2.1  )  Set  the  new  trial  radius,  A*+1,  to  xA* 

3.2.2  )  If,  and  only  if,  pk  satisfies  the  inexact  quasi-Newton  condition 

for  its  associated  model  with  Hessian  Bk,  then  we  are  allowed  to 
set  A*+1  to  x  1 1  1 1  • 


Proceed. 
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Step  3)  also  allows  a  very  broad  range  of  actions.  We  are  always  allowed  to  set 
the  new  trial  trust  radius  to  be  larger  (up  to  a  constant  factor)  than  the  current  radius. 
There  are  two  instances  when  it  can  be  chosen  smaller  than  the  current  value.  If 
P/(p/)<rl2  f°r  some  step  Pi  computed  in  2.2),  then  the  next  trial  radius  can  be  set  to 
any  value  greater  than  0.  Also,  if  the  step  finally  chosen  satisfies  the  inexact  quasi- 
Newton  condition  (4.6b),  then  we  can  set  the  new  trial  trust  radius  so  that 
A*+1e  t  II  Pk  II  ’73 II  Pk  II  ]•  This  could  be  less  than  the  current  radius  A*  since 
\\pk  II  may  be  less  than  A*. 

Finally,  the  new  trial  radius  can  be  set  to  any  value  greater  than  zero  if  any  step 
we  calculated  using  some  model  with  index  j  at  any  stage  of  step  2.2)  or  2.3)  gave 
p i(p{)  <  r\2  unless  the  actual  step  pi  accepted  came  from  2.1). 

4.7,  Global  Convergence  Theory 

We  now  state  the  main  theorem  of  this  chapter.  The  proof  of  this  theorem  is  similar 
to  a  SSB  [1985]  result  for  single  model  algorithms.  However,  provisions  are  made  to 
allow  inexact  quasi-Newton  steps  as  well  as  exact  ones  to  be  used  for  trust  radius 
modification.  Also,  our  proof  does  not  impose  any  conditions  on  how  the  algorithm 
executes  an  internal  or  external  trust  radius  reduction,  so  long  as  this  reduction  is 
allowed.  This,  combined  with  the  fact  that  it  applies  to  multiple  models,  makes  it  a 
significant  generalization  of  existing  theory. 


Consider  the  following  conditions. 
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(A)  Every  model  is  a  quadratic  in  standard  form. 

(B)  pkeA(i, Ak)  implies  that,  for  some  Cje  (0,1)  independent  of  k ,  pk  satisfies 

f  II  ft  H  1 

predlk  iplk)  >  c3  II  ft  II  minj  A*  ,  — y  |  • 

(C)  plkeA  (i  ,Ak )  implies  II  II  <  (  1  +  ot  )  At  for  some  ax  €  [0 , 1 )  independent  of 

k. 

(D)  There  exists  (32>0  independent  of  k  such  that  II  Bk  II  <P2  for  k- 1,2,...  and 
i=\,2,...,Nk . 

We  then  have  the  following. 

Theorem  4.6  :  Let  x  t  €  R"  and  f  satisfy  the  standard  assumptions.  Suppose  that  an 
algorithm  implemented  consistently  with  A. 4. 3,  AS. 4. 4,  and  AS. 4. 6  is  applied 
to  f  starting  from  x  and  that  the  conditions  (A),  <B),  and  (C)  hold.  Let 
x  [  e  ir"  and  f  satisfy  the  standard  assumptions.  Then  the  algorithm 
generates  a  sequence  {xk  }  which  has  the  descent  property. 

Furthermore,  if  (D)  holds,  then  {gk  }  converges  to  zero. 

Proof  :  This  proof  is  in  three  parts.  In  (1)  we  establish  the  existence  of  [xk } 
and  the  descent  of  {/(**)}.  In  (2)  we  assume  that  {xk )  remains  in  (defined  in 
(4.19))  for  all  k  >m  and  show  that  this  leads  to  a  contradiction.  In  (3)  we  prove  the 
main  result  by  appealing  to  Theorem  4.4. 
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(1)  We  first  show  that  a  sequence  {pk  }  satisfying  the  conditions  of  the  algorithm 
exists  and  has  the  descent  property.  Since  Nk<°°  the  algorithm  will  eventually  either 
find  a  satisfactory  step  pk  with  A*  >A*,  or  proceed  to  step  2.3)  for  trust  radius 
reduction.  In  the  latter  case,  consider  any  k  >  1  for  which  xk  e  L(f  >x  t).  Since  every 
model  is  a  standard  quadratic,  Lemma  4.2  establishes  that  there  exists  a  8>0  for 
which  A*  <5  and  p[  e  A(i,Ak)  imply  p*(p*)>rh-  Thus  for  A*  sufficiently  small,  the 
algorithm  generates  a  satisfactory  step  pk . 

Furthermore,  any  step  pk  =pk  accepted  by  the  algorithm  must  satisfy  the 
condition  p j(pk)  >  T\y,  which  with  (B)  implies  aredk(pk)> 0.  But  x  { e  L  (f  jc  t),  so 
that,  by  induction,  {jt* }  exists  and  satisfies  the  descent  condition. 


(2)  Let  Qm  be  defined  as  in  (4.19)  and  consider  any  m  >  1  with  II  gm  II  *  0.  We  want 
to  show  that  some  later  iterate  is  not  in  Qm .  First,  we  will  establish  some  properties 
for  any  xk  e  Q.m . 

Lemma  4.3  implies 


i**ii  s-LisJ1  • 


(4.30) 


so  that  from  (B)  and  (D),  we  have 


predk (pt )>  c 3  !l&  II  min  -s  A*  , 


1 


Pi 

HftJI 


^  yc3  llgmll  min j  A*  , 


(4.31) 


From  (4.31),  (A),  and  Lemma  2.6  we  have 
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1-P*(P*)I  = 


aredk (pj)  -  predk(pk) 
pred‘k(pk) 


-H(xk  +^))(P*)(  1  - 
o  _ _ _ 

i  M  M  .  f  *  'Um11  ' 

I0-'1'*”11  H  A‘ 


(4.32) 


Including  (C),  (D),  and  (SA3)  in  (4.32)  yields 


l -pi  (Pi  ) 1  ^ 


( 1  +  a< )-  A^2  (  Pi  +  P2  ) 


c3  HgmH  mins  A*  , 


(4.33) 


Let  the  scalars  and  5^  be  defined  as 


Hjgm11  ^  3  (  1-T12  ) 

4Y3(l+a!)3  P1+P2 


(4.34) 


K  =  ( 1  -t-CT!  )y3  5m 
_  C3  HgmH  1-112 
4(l+a!)2  Pi+P2 


(4.35) 


Consider  any  A*  <8*.  Since  c3<l,  from  (4.35)  we  have  A*  <  — ^  ,  so  (4.33) 

implies 


l  "Pi  (Pi)  I  < 


(  1  +  CTj  )2  A*  (  Pj  +  P2  ) 

<  c3ll*JI 

<  (1|?a'),2-  <  Pi  +  Pa )  K 

C3 


(4.36) 
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Hence  from  (4.34),  for  any  A^  <  8£  and  xk  e  Clm , 

I  i -pi (pi) i  <  "t ( 1  - ^2 ) <  i-n2  (4.37) 

4 

and 

2-T)2>Pi(pi)>Tl2  ■  (4.38) 

Now,  suppose  that  for  some  m,  llgmll  *0  and  xk  e  Qm  for  all  k  >m  .  We 
establish  that  this  assumption  leads  to  a  contradiction. 

(2a)  By  supposition,  II  gm  II  *0  and  xk  e  Qm  for  all  k  >m  .  Suppose  further  that  lim 

k  -> 

sup  A*  <8m.  Let  k  be  the  smallest  integer  for  which  Ak  <5m  for  all  k  >k.  Let 
be  defined  as  in  (4.6b).  Assume  without  loss  of  generality  that  k  >  m  and  that  if 
option  3.2.2)  is  being  used  in  AS.4.6,  then  k  is  sufficiently  large  to  ensure  C*  <  y  f°r 
all  k  >  k . 

For  any  A*  <8m,  AS.4.6  and  (C)  imply 

A*  <  y3  max  {  A*_j ,  II  p*_1  II  }  <  y3  (  1  +  )  A*,!  (4.39) 

<  y3  (  l  +  Oj  )  8m  <  8+  . 

Hence  for  every  k  >k,  (4.38)  implies  that  any  pk  s  A(i,Ak)  computed  in  phase  2.2) 
satisfies  pk(pk)  >  ri2  -  so  that  phase  2.3)  cannot  be  executed.  Therefore  A*  >Ak  for 
every  k  >k . 

Now  consider  the  computation  of  A*+1  as  specified  by  AS.4.6.  We  will  show 
that  if  xk  e  Qm,  k  >k,  then  Ak+[>Ak.  Since  Ak<,8m,  (4.38)  and  the  definition  of  the 
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algorithm  imply  x>  1.  If  the  trial  trust  radius  for  the  next  iteration  is  computed  by 
phase  3.2.1)  of  AS.4.6,  then 

A*+1  =  tA*  >  Ak  .  (4.40) 

If  is  a  quasi-Newton  step  or  an  inexact  quasi-Newton  step  for  some  model,  we 
have  from  (4.6b) 

ii  s;  ii  ii  pi  ii  >  ii  s;P;  11  =  n  gk  -(b;p;  +  *,)  m 

>11  gk  II  -II  Bipi+gl  II  (4.41) 

2  (  K*  )  II  gk  II 


and 


Pk 


(  1  ~  C*  )  11  Jgjfe 
II  BL II 


8k 


2|32  "  4  P2 


(4.42) 


>  5m  >  A* 


If  the  trial  trust  radius  for  the  next  iteration  is  computed  by  phase  3.2.2)  of  AS.4.6, 
(4.42)  implies 

A*+1=tll^H  (4.43) 

so  that  in  either  event  A*+1  >  A*  for  every  k  >k. 

Let  8"  aAf+1.  Since  Ak  >Ak  and  Ak  >Ak_{  for  all  k  >k , 

lim  inf  Ak  >  5“  >  0  .  (4.44) 


Hence,  if  our  supposition  is  true,  either 
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8m  >  lim  sup  A*  >  lim  inf  A*  >  8m  >  0 

k  — »  M  k  — >  =o 


(4.45) 


lim  sup  A*  >  8m  >  0. 

&  -4  » 


(4.46) 


In  either  event,  lim  supA*  >  0. 


(2b)  By  supposition,  II  gm  II  *0  and  xk  e  Qm  for  all  k  >m  .  Conditions  (B),  (D)  and 
(4.31)  imply 


fk  -fk+i  =  aredk{plk)  >  m  predlk(p'k) 


*  y  Bi  C3  5m  nrinj  A* 


(4.47) 


By  the  standard  assumptions,  /  is  bounded  below  so  that  the  descent  condition 
implies  lim  (fk  -/*+1  )  =  0.  Therefore  lim  A*  =  0,  which  c onrradicts  (2a). 

k  k  —>°° 

Hence,  for  every  m  with  II  gm  II  *0,  there  exists  m>m  for  which  xm  &Qm- 


(3)  Part  (1)  showed  that  {** }  remains  inside  Lif  pc  j),  and  Part  (2)  implied  that  for 
every  m  with  II  gm  II  * 0,  there  exists  m  > m  for  which  x-4Q.m.  Hence,  eventually 
some  x-  must  satisfy 


X  —  X  >  • 


'I  5m  I* 


(4.48) 


Therefore,  Theorem  4.4  implies  that  {g*  }  converges  to  zero.  Q 


Corollary  4.7  :  Let  x  { €  IR"  and  f  satisfy  the  standard  assumptions.  If  conditions 
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(A),  (B),  and  (C)  hold,  then  the  sequences  {.t* }  generated  by  A.4.1  and  A.4.2 
starting  from  x  {  exist  and  are  FOSPC. 

Proof  :  The  standard  assumptions  on  /  and  the  definitions  of  A.4.1  and  A.4.2 
imply  that  each  model  satisfies  conditions  (A)  and  (D).  Existence  and  FOSPC  of 
[xk }  then  follow  directly  from  application  of  Theorem  4.5  and  Theorem  4.6  to  A.4.1 
and  A.4.2.  Q 

4.8.  Other  Allowable  Options 

We  also  present  a  variation  of  AS.4.6  that  has  one  more  allowable  option  at  the  cost 
of  one  more  restriction.  Both  of  these  different  details  are  commonly  found  in  trust 


region  implementations. 
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Algorithm  Specification  AS.4.7  :  Permissible  Choices  in  Step  3)  at  the  k!^ 
Iteration 

Let  0  <  r)i  <  T"h  <  H3  <C  e  [  0,1),  and  l  <  - - —  ^  Yi  -  Y3- 

l  - 

3)  Choose  a  new  trial  trust  radius  for  the  next  iteration  : 

3.1)  If  pj{pj)  >  tl3  for  the  Pk  selected  in  step  2.2)  then  we  are  required  to 

choose  t  €  [Yi-YiK 

Else 

3.1.1)  Either  choose  an  increase  factor  x  €  [1,y3], 

3.1.2)  Or,  if  phase  2.1)  was  not  executed  ,  and  if  for  some  model  and 
associated  step  pi  e  A  (J  ,Ak  )  from  phase  2.2)  or  p[  e  A  O', A*)  from 
phase  2.3)  we  have  p [(pi)  <  r|2,  then  we  are  allowed  to  choose  a 
decrease  factor  x  e  (0,1]. 

3.2)  Set  the  new  trial  radius  to  either  xAt  or  xll  pk  II . 

Proceed. 


The  logic  in  3.1)  forces  the  use  of  an  increase  factor  x  under  certain  conditions. 
Because  of  this,  the  new  trial  trust  radius  can  be  based  on  either  the  length  of  the  last 
step  or  the  size  of  the  last  trust  radius.  In  AS. 4.6,  care  must  be  taken  if  the  norm  of 
the  current  step  is  used  to  calculate  the  next  trial  radius  because  if  pk  lies  on  the 
interior  of  the  trust  region,  the  trust  region  may  be  reduced  even  if  no  decrease  factor 
was  applied. 

Consider  the  following  conditions.  Only  the  condition  (C)  differs  from  the 
assumptions  of  the  previous  theory. 

(A)  Every  model  is  a  quadratic  in  standard  form. 
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(B)  p/  e  A  d  At )  implies  that,  for  some  c3e  (0,1),  pk  satisfies 

r  ii  gk 

predlk  (p‘k)  >  c ,  II  II  rrunj  A*  , 

(C)  pj.  6  A(iAk)  implies  condition  (4.7  a  or  b)  or  (4.8  a  or  b)  for  some  a, €  [0,1) 
independent  of  k . 

(D)  There  exists  (32>0  independent  of  k  such  that  II  Bk  II  <(32  for  £  =  1,2,  ... 

,  /  =1,2,  •  •  •  JV*. 

Theorem  4,8  :  Let  x  { e  IR”  and  f  satisfy  the  standard  assumptions.  Suppose  that  an 
algorithm  implemented  consistently  with  A.4.3,  ASA  A,  and  AS.4.7  is  applied 
to  f  starting  from  x ,,  and  that  the  conditions  (A),  (B),  and  (C)  hold.  Then 
the  algorithm  generates  a  sequence  {xk  }  which  has  the  descent  property. 
Furthermore,  if  (D)  holds,  then  [gk }  converges  to  zero. 

Proof  :  The  proof  is  almost  the  same  as  that  for  Theorem  4.6.  AS.4.7  differs 
ffom  AS. 4. 6  only  in  the  trust  radius  modification  logic,  and  hence  only  section  (2)  of 
the  proof  need  be  modified.  We  need  only  make  the  following  changes. 

(1)  Substitute  ri3  for  ti2  throughout  the  proof. 

(2)  In  the  third  paragraph  of  (2a),  notice  that  since  Ak<8m,  (4.38)  implies  that 
pk(pk)>r\3.  Therefore.  AS.4.7  requires  the  selection  of  a  x  satisfying 


T  >  y2  >  (  1  -  CTi  )  1 


(4.49) 
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Now,  AS. 4.7  always  allows  A*+l,  to  be  defined  by  ill  pj,  II  as  in  (4.43),  but  (C) 
implies  that  p‘k  either  satisfies  the  inexact  quasi-Newton  condition,  or 


II  p^  II  >(  1-o^A* 

In  either  event,  (4.42),  (4.49),  and  (4.50)  imply 


(4.50) 


(4.51) 


A*+i  =  til  plk  II  >  Y2(  1  -CTt)A k 

for  every  k  >k;  hence  (4.43)  through  (4.46)  are  true  for  AS.4.7  as  well  as  AS. 4.6. 
This  establishes  the  result.  [J 


4.9.  Summary  :  Type  One  Algorithms 


Definition  4.3  :  We  refer  to  any  algorithm  of  the  form  A.4.3  implemented 
consistently  AS. 4.4  and  AS. 4.6  or  AS.4.7  as  a  type  one  algorithm. 


This  class  of  algorithms  puts  no  limitations  on  how  model  switching  or  selection 
can  be  done,  and  puts  few  limitations  on  the  permissible  range  of  trust  region  logic. 
However,  each  of  the  models  must  be  quadratic  with 

lim  sup  II  Blk  II  <  °°  . 
k  — 

It  should  be  noted  that  the  number  of  models  available  at  any  given  iteration  is 
not  required  to  be  constant.  In  fact,  any  finite  number  of  models  could  be  used.  For 
example,  consider  the  model  generation  scheme  of  Nazareth  [1980,1983]  that  selects 
a  particular  convex  combination  of  two  model  Hessians.  A  modification  could  be 
made  to  Nazareth’s  algorithm  so  that  it  would  consider  more  than  one  possible 
convex  combination.  Our  theory  allows  the  algorithm  to  draw  any  number  of  model 
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Hessians  from  the  infinite  set  of  possible  models  as  long  as  the  sampling  procedure  is 
finite.  If  there  exists  a  uniform  upper  bound  on  the  curvature  of  each  of  these 
models,  then  the  algorithms  of  this  chapter  are  FOSPC. 


CHAPTER  5 


Safe  Algorithms  using  fewer  restrictions  on  the  various  models 

5.1.  Introduction 

In  this  chapter,  we  assume  that  at  least  one  model  is  a  quadratic  in  standard  form 
with  uniformly  bounded  curvature.  The  other  models  and  associated  trial  steps  are 
only  assumed  to  obey  a  “uniform  predicted  decrease”  condition.  We  present  model 
switching  algorithms  that  are  FOSPC  under  these  assumptions.  These  algorithms  are 
computationally  efficient,  even  if  implemented  with  a  large  number  of  models. 
Furthermore,  no  unnecessary  assumptions  are  made  about  the  alternative  models. 
Specifically,  these  models  need  not  be  quadratic,  and  they  are  not  even  required  to 
generate  steps  which  are  in  descent  directions.  These  algorithms  and  the  theory 
describing  them  thus  represent  a  significant  advance  in  the  understanding  of  this  area. 

5.2.  Preliminary  Definitions 

We  use  much  of  the  same  nomenclature  as  in  Chapter  4.  Standard  quadratic  models 
are  defined  as  in  (4.1).  Now,  however,  predlk{p)  may  refer  to  a  general  model  which 
need  not  be  quadratic.  Jhe  actual  function  reduction  is  defined  by  (4.10),  and  p*(p)  is 
defined  as  in  (4.1 1). 
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Definition  5.1  :  A  set  of  models  {predk},  k- 1,2,...  ,  / —1,2,...  .  Nk  and  associated  set 
of  steps  { plk }  is  said  to  satisfy  the  uniform  predicted  decrease  condition 
(LTD)  if  there  exist  c3e  (0,1)  and  (33e  [0 ,«)  such  that 

f  II  gk  I!  1 

predlk  (plk)  >  c3  II  ft  II  minj  A*  ,  — — —  >  .  (5.1) 

The  UPD  condition  is  a  statement  about  a  sequence  of  models  and  the  associated 
procedures  for  computing  trial  steps.  It  is  not  a  requirement  on  the  function,  or  on 
how  well  the  model  matches  the  function. 

In  Chapter  4,  all  models  were  assumed  to  be  standard  quadratics  with  uniformly 
bounded  curvature.  These  assumptions  were  used  in  two  different  ways  in  our 
proofs.  First,  they  were  sufficient  to  ensure  that  FCD  and  OLC  steps  satisfied  the 
UPD  condition  given  by  equation  (4.5).  They  were  also  sufficient  to  ensure  that 
FOSPC  is  not  prevented  by  spurious  reduction  of  the  trust  radius.  The  algorithms  of 
this  chapter  still  require  all  trial  steps  to  satisfy  the  UPD  condition,  now  denoted  by 
(5.1).  However,  by  linking  the  trust  radius  reduction  logic  with  the  model  switching 
logic,  we  show  that  the  inclusion  of  a  single  standard  quadratic  model  with  uniformly 
bounded  curvature  is  a  sufficient  condition  to  prevent  spurious  trust  radius  reduction. 

For  standard  quadratic  models  and  algorithms  which  use  FCD  or  OLC 
procedures  to  compute  steps,  conditions  weaker  than  uniformly  bounded  curvature  are 
sufficient  to  imply  the  UPD  condition.  In  Chapter  6,  we  will  present  some  of  these 


conditions. 
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An  interesting  point  is  that  no  requirement  is  made  that  pk  be  a  descent 
direction,  as  long  as  the  UPD  condition  is  satisfied.  For  example,  some  of  the  local 
minimizers  of  problem  TRS  in  the  indefinite  case  may  be  in  a  direction  which 
initially  increases  the  function,  yet  which  still  result  in  a  sufficient  amount  of  function 
reduction. 

We  will  make  use  of  the  following  definitions.  First,  recall  the  e-test  from 
Chapter  3.  The  observed  error  for  a  given  model  and  test  step  is  defined  as 

ek(p)  =  I  aredk(p  )  -  predlk(p)  I  .  (5.2) 

The  set  of  indices  to  models  available  at  a  given  iteration  k  is  defined  as 

h  =j;  :  1  £/  £N„\  .  (53) 

Any  arbitrary  partition  of  Ik  will  be  represented  by  two  sets  of  indices  denoted  Ik  and 

A 

lk  which  satisfy 

h  UJ  h  -  h  (5-4) 

and 

h  O  h  ~  empty  (5.5) 

5.3.  Introductory  Examples 

Under  the  assumption  that  using  a  constant  model  Hessian  is  sufficiently  less 
expensive  to  compensate  for  its  slower  convergence,  Algorithm  A. 4.1  of  the  last 
chapter  used  a  constant  matrix  as  the  primary  model  and  the  exact  Hessian  as  a 
backup.  In  addition  to  being  inexpensive  to  work  with,  constant  Hessian 
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approximations  are  natural  candidates  for  safeguarding  other  models. 

Let  model  one  be  some  arbitrary  model  of  interest  which  is  assumed  to  generate 
steps  satisfying  the  UPD  condition.  For  a  constant  symmetric  matrix  B ,  let  model 
two  be 

pred^ip  )  =  -glp  -~p‘Bp  (5.6) 

The  following  algorithm  always  tries  model  one  first,  but  may  change  to  the  constant 

model  •  model  one  exhibits  poor  performance.  The  matrix  B  can  be  something  as 
simple  as  the  identity  matrix. 


Algorithm  A. 5.1  :  Safeguarding  with  a  Constant  Model 

0)  Let  s  IRn  ,A[  >0.  and  two  models  be  given. 

1 )  Repeat  until  convergence  : 

2)  Compute  pk  s  A  ( 1  .A* ). 

3)  If  p t(Pk  K0.01,  then 

Compute  pk  e  A  (2.A* ). 

If  Pi2 (PC ) <0.01,  then 

Set  A*  =A*  /4  and  return  to  2), 

Else  set  pk  =pk2,  p*  =  p*  (p*2 ), 

Else 

Set  pk  =pkK  pk  =p£{Pk  )• 

End  if. 

4)  Set  A*  +  l  ~Ak . 

If  p^  >0.5  set  Ak+l  =  2Ak. 

5)  Set  xk+l  =xk  +pk  and  define  model  1  for  the  next  iteration. 

6)  End 


Rather  than  defining  the  constant  model  to  be  the  primary  model,  this  algorithm 
always  uses  it  as  the  backup.  The  backup  model  is  now  assigned  the  function  of 
making  the  algorithm  safe  rather  than  accelerating  it.  This  is  an  example  of  the 
capability  of  multi-model  algorithms  to  safeguard  unproven  methods.  Notice  that  this 
algorithm  will  only  reject  model  one  if  it  is  performing  poorly  already  and  model  two 
looks  better.  A  badly  scaled  choice  of  the  “safe”  model  (such  as  the  identity  matrix) 
will  not  cause  the  performance  degradation  seen  in  Algorithm  A. 3. 2.  Notice  also  that 
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no  requirement  is  made  concerning  whether  or  not  a  failure  of  the  primary  model 
results  in  it  being  “reset"  to  the  value  of  the  backup. 

Algorithm  A. 5.1  has  to  perform  two  function  evaluations  before  any  trust  radius 
reduction  can  be  done.  Since  such  reductions  are  not  uncommon,  we  now  give  a  less 
expensive  algorithm. 


Algorithm  A.5.2  :  Safeguarding  a  method  with  a  Constant  Model 


0)  Let  x  {  e  IR'1  ,  A(  >  0,  and  two  models  be  given. 

1)  Repeat  until  convergence  : 

2)  Compute  pk  e  A  ( 1  ,A* ). 

3)  If  pk  ( pk  )  <  0.0 1  and  ek  (pk  )>  5  ek  ( pk  ),  then 

Compute  pk  e  A  (2,Ak ). 

If  ( P*2 )  <  0.0 1 ,  then 

Set  Ak=Ak  14  and  return  to  2), 

Else  Set  pk  =pk 2,  p^  =p*2(p*2). 

Else, 

Set  pk=pk,  p*=p k(Pk  )• 

End  if. 

4)  Set  Ak+l=Ak. 

If  pA  >0.5  set  Ak+l  =  2Ak. 

5)  Set  jcfc+1  -xk  +pk  and  define  model  1  for  the  next  iteration. 

6)  End 


This  algorithm  is  identical  to  A. 5.1  except  for  the  added  requirement  in  step  3). 
However,  it  can  often  avoid  extra  function  evaluations.  Specifically,  even  if  the 
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primary  model  is  performing  poorly,  we  need  not  compute  and  test  a  step  from  the 
alternative  model  unless  it  does  a  significantly  better  job  of  prediction  for  the  current 
trial  step  than  the  primary  model.  Thus  this  algorithm  is  highly  biased  against  the 
use  of  the  secondary  model,  even  beyond  the  fact  that  the  primary  model  is  always 

cried  first. 

Because  of  the  its  low  additional  overhead,  its  high  bias  toward  using  the 
primary  model,  and  its  insensitivity  to  the  possibility  of  a  poorly  scaled  secondary 
model,  this  algorithm  is  exceedingly  well  suited  to  safeguarding  new  or  unproven 
methods.  Again,  no  requirements  are  made  on  the  primary  model  other  than  the  UPD 
condition.  A  less  simplified  version  of  this  algorithm  is  suggested  for  general  use. 

A  version  of  A.5.2  could  also  be  used  as  a  method  of  deciding  when  to 
"refresh”  a  secant  model  with  an  exact  Hessian  by  using 

predk{p  )  =  -gkP  ~  ~p‘  Hkp  •  To  avoid  evaluating  H  unnecessarily  when  performing 
the  e-test,  predk~(pk)  should  be  approximated  by  finite  differences. 

5.4.  The  Basic  Algorithm  Framework 

We  now  present  algorithms  for  model  switching  when  at  least  one  model  is  a 
standard  quadratic  with  uniformly  bounded  curvature  and  all  of  the  models  satisfy  the 
UPD  condition.  Algorithm  A.4.3  is  still  used  for  the  basic  framework. 
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5.5.  Computation  of  an  Acceptable  Step 

The  following  specifications  are  sufficient  to  define  the  range  of  allowable  options  in 
step  2)  of  A.4.3.  They  differ  from  those  given  for  type  one  algorithms  only  in  their 
criteria  for  allowing  trust  radius  reduction. 


Algorithm  Specification  AS.5.3  :  Permissible  Choices  in  Step  2  of  A.4.3  at 
the  k—  Iteration 

Let  p>0,0<ri1<r|2<^  <a  trial  01181  ra^lus  At>  0  ,  and  a  set  of  Nk  models 
be  given. 

2.1)  We  are  allowed  to  accept  Ak  >  Ak  and  pk  =pk  for  some  model  with  index  i 

whenever  such  a  radius,  model,  and  step  satisfying  both  pk  =pk  s  A(i  Ak) 
and  p k(pk)  >  r|t  can  be  found. 

2.2)  We  are  allowed  to  accept  Ak  =  Ak  and  pk  =plk  for  some  model  with  index  i 

whenever  both  plk&  A  (i  Ak)  and  pkipk)  ^  Hi- 

We  are  allowed  to  go  to  step  2.3)  for  trust  radius  reduction  only  if  one  of  the  fol¬ 
lowing  holds  for  the  trial  steps  computed  in  2.2)  : 

Reduction_Test_0  :  Every  model  with  index  i  e  Ik  and  associated  trial  step 
pke  A  (i  Ak)  for  the  trial  trust  radius  Ak  satisfied  p k{pk)  <  p2  >  or  ^ 

Reduction_Test_l  ^  p lipf)  <  r\2  for  i  e  Ik  ,  and  for  every  j  e  lk  3  at 
least  one  i  6  Ik  such  that  \ie{(pk)  >  ek{pk). 

2.3)  Find  some  Ak  6  ( 0,Ak ) ,  model  with  index  i ,  and  pk  =p[  A  (i  Ak )  satisfying 

pi  (Pk)  ^  Til- 


In  Reduction_Test_0,  lk  is  the  set  of  all  models.  In  an  implementation  of 
Reduction_Test_l,  Ik  refers  to  models  for  which  we  have  already  computed  pk  and 
aredk(pk ),  while  lk  is  the  set  of  models  for  which  we  have  not  done  this 
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computation.  These  two  tests  are  often  easier  to  apply  if  stated  to  specify  when  trust 
radius  reduction  is  not  permissible.  In  Reduction_Test_0,  reduction  is  not  allowed  if 
there  exists  a  step  i  e  Ik  which  satisfies  p*(p*Rr|2.  In  Reduction_Test_l,  reduction 
is  not  allowed  if  either  of  the  following  hold. 

(i)  There  exists  a  step  i  e  Ik  which  satisfies  p i(p*)^n2,  or 

(ii)  If  pk(pk) <Tl2  for  a11  ‘  s  h  >  and  there  exists  j  e  lk  with 

<5-7> 

r 

for  all  i  s  lk . 

We  now  consider  this  algorithm  in  more  detail. 


5.5.1.  Step  2.1:  Internal  Doubling 

This  step  is  specified  identically  with  type  one  algorithms. 

5.5.2.  Step  2.2  :  Acceptance  of  Trial  Trust  Radius 

The  conditions  for  acceptance  of  a  potential  step  pk  are  the  same  as  for  type  one 
algorithms.  Any  step  found  which  satisfies  pkeA(i, Ak)  and  pk(pk)>T\{  can  be 
accepted.  Even  if  such  a  step  has  been  found,  the  algorithm  is  still  allowed  to 
compute  the  test  steps  from  other  models.  However,  the  conditions  under  which  we 
can  reduce  the  trust  radius  by  going  to  step  (2.3)  are  more  restrictive.  Two 
alternative  conditions  are  shown  in  AS. 5. 3  :  Reduction_Test_0  and  Reduction_Test_l. 
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Reduction_Test_0  prohibits  reducing  the  trust  radius  if  any  of  the  available 
models  generates  a  step  which  satisfies  p^(plc)>r\2-  Thus>  >n  a  three  tnodel  system, 
even  if  p/lp/ XT) l  and  p*  tpTKn  i,  Pk  must  still  be  computed  and  tested  before  any 
reduction  is  allowed.  Reduction  is  only  required  if  steps  have  been  computed  for  all 
of  the  models  and  p*(p*)<Tli  f°r  each.  If  Tj[=r|2’  notice  that  the  reduction  is  required 
if  and  only  if  it  is  allowed. 

Algorithm  A. 5.1  is  an  example  of  logic  based  on  this  criteria.  Even  for  this  case 
where  only  two  models  are  used,  we  see  that  Reduction_Test_0  criteria  can  lead  to 
excessive  switching. 

Reduction_Test_l  is  much  more  efficient.  Say  that  we  have  computed  p*  and 
found  p^fp/;1  )<T1 1-  Then  we  are  allowed  to  reduce  the  trust  radius  unless  one  of  the 
other  models  does  a  significantly  better  job  of  predicting  the  actual  function  value. 
The  parameter  p  can  be  adjusted  to  control  the  relative  frequency  between  model 
switching  and  trust  radius  reduction  in  the  algorithm’s  search  for  an  acceptable  step. 

Algorithm  A.5.2  is  an  example  of  logic  based  on  Reduction_Test_l.  Model 
switching  is  suppressed  by  setting  p  to  the  relatively  large  value  of  5. 

Further  discussion  of  condition  Reduction_Test_l  is  given  in  Section  5.7. 
Reduction_Test_0  is  included  here  only  as  an  elementary  example  of 


Reduction  Test  1,  and  is  not  recommended  for  actual  use. 
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5.5.3.  Step  2.3  :  Internal  Decrease  of  the  Trust  Radius 

If  the  loeic  in  step  2.2)  allows  an  internal  decrease,  then  we  can  use  any  procedure  to 
do  so,  as  long  as  the  final  step  chosen  satisfies  pk  =pk  e  A  (/  At)  f°r  some  model  with 
index  i  and  Ak  e  (0,A*  ),  and  No  requirement  is  made  on  how  the 

radius  is  reduced  or  whether  or  not  the  same  model  is  retained.  Therefore  both  the 
NL2SOL  (DGW  [1981])  tactic  of  retaining  the  same  model  throughout  this  phase  and 
the  Dennis,  Sheng,  Vu  [1985]  tactic  of  automatically  switching  after  5  unsuccessful 
decreases  are  admissible. 

For  quadratics  (and  most  other  reasonable  models),  Procedure  A.4.5  is  assured 
of  finding  an  acceptable  step  in  a  finite  number  of  minor  iterations.  This  is  not 
automatically  true  for  all  models.  A  trivial  example  is 

pre4(P)-^~  <5'8) 

Til 

which  satisfies  the  UPD  condition  for  any  p  yet  does  not  satisfy  pk(p)>r}[  as 
||p  II  _>o .  Models  of  this  sort  are  unlikely,  but  if  they  are  included  in  the  algorithm. 
Procedure  A.4.5  will  still  find  an  acceptable  step  in  a  finite  number  of  minor 
iterations  if  it  switches  models  every  time  that  a  fixed  number  of  internal  decrease 
steps  have  been  unsuccessful. 

We  give  the  following  procedure  for  a  two  model  algorithm  with  c  the  index  of 
the  current  model  and -.a  the  index  of  the  alternate.  A  new  parameter  is  introduced: 
the  integer  jc  e  [l,oo)  .  The  function  mod(j ,jc)  is  the  standard  “modulus”  operator 
that  has  value  0  if  and  only  if  j —  jc .  The  portions  of  the  procedure  involving  jc  are 
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included  to  guarantee  successful  termination  of  the  inner  reduction  loop  for  general 
models  and  are  are  not  essential  if  both  models  are  quadratics.  A  suggested  value  for 

jc  is  5- 


Procedure  A.5.4  :  Internal  Decrease  of  the  Trust  Radius 


Let  Y[€  (0.1  ),  r|i  e  (0,11  and  jc  e  [  1 ,~)  be  given. 

For  7  =  1,2,...  do  : 

Set  €  (0 , Yi  A*  ]. 

Compute  pk  €  A(c  Ak ). 

If  p k(Pk  )  -Oi,  then  set  xk+l  -xk  +p£  and  exit. 

If  mod  (j  ,jc )  =  0,  then 

Switch  model  preference  by  switching  the  indices  a  and  c . 
End  if. 

End  for. 


Again,  this  procedure  is  not  required  if  both  models  are  quadratics  in  standard  form. 

5.6.  External  Modification  of  the  Trust  Radius 

The  following  specifications  are  sufficient  to  define  the  range  of  allowable  options  in 


step  3). 
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Algorithm  Specification  AS.5.5  :  Permissible  Choices  in  Step  3)  of  A.4.3  at 
the  k—  Iteration 

Let  y3  >  1  be  given. 

3)  Choose  a  new  trial  trust  radius  for  the  next  iteration  as  follows  : 

3.1)  Do  one  of  the  following  : 

3.1.1)  Choose  an  increase  factor  x  e  [  1  ,y3] , 

3.1.2)  Or,  if  test  Reduction_Test_0  or  Reduction_Test_l  was  satisfied  at  the 
last  point  of  phase  2.2), 

then  we  are  allowed  to  choose  a  decrease  factor  x  e  (0,1]. 

3.2)  Set  the  new  trial  radius  to  xA*  . 


As  in  type  one  algorithms,  a  trust  radius  increase  by  up  to  a  constant  factor  is 
always  allowed.  If  a  trust  radius  decrease  is  allowed,  it  can  be  done  in  any  fashion 
desired.  However,  the  test  allowing  reduction  is  more  stringent,  in  that  plk{plk)<x\2 
for  the  step  selected  may  not  be  a  sufficient  condition  to  allow  decrease  if  ,  for 
example,  p/(p/)>T|2  for  an  alternative  model. 

5.7.  Algorithm  Efficiency  and  the  Parameter  p 

As  stated  previously,  a  multi-model  algorithm  must  not  require  significantly  more 
work  per  iteration  than  a  single  model  algorithm.  The  parameter  p  in 
Reduction_Test_l  allows  us  to  control  the  amount  of  model  switching  in  a  very 
natural  manner.  As  p  — >  the  algorithm  becomes  indistinguishable  from  a  model 
selection  algorithm.  As  p  — *  0,  the  algorithm  becomes  indistinguishable  from  one 
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which  always  tests  all  models  before  allowing  a  trust  region  reduction. 

NL2SOL  uses  p=l.5,  which  was  experimentally  found  to  be  sufficiently  large  to 
prevent  excessive  switching,  but  small  enough  to  keep  the  trust  radius  commensurate 
with  the  ‘'best"  model.1  No  preference  is  given  to  either  model  a  priori  (other  than 
the  hereditary  preference),  but  p.  can  also  be  used  to  bias  toward  a  model  with  a 
lower  overhead. 

Suppressing  switching  by  choosing  |i  to  be  large  is  natural  precisely  because  of 
the  conditions  under  which  it  allows  switching.  A  value  of  100  will  make  model 
switching  extremely  rare,  yet  switch  if  the  performance  of  the  alternative  model  is 
undeniably  superior. 

The  use  of  more  than  three  possible  models  is  somewhat  speculative,  but  could 
occur  in,  say,  an  expert  system  code  for  optimization.  Even  for  a  large  number  of 
models,  this  type  of  algorithm  is  efficient.  Consider  Nk  =  16  and  |i=2.  If  the  initial 
trial  model  yields  PkiPk)<r\2>  then  the  alternatives  are  scanned  for  those  which  are  at 

least  twice  as  good  at  predicting  the  actual  reduction:  el(pkl)  <  —  ekl(pkl).  Say, 

pessimistically,  that  models  2,  3,  ...,  8  all  satisfy  this  condition.  Then  models  9,  10, 

16  can  be  dropped  from  consideration,  and  we  choose  the  remaining  model  with 
the  smallest  value  of  <?i{Pk)  to  compute  the  next  trial  iterate.  Now  we  scan  models 
3,  4,  ...,  8  to  find  any  which  are  twice  as  good  as  model  two  at  predicting  the  actual 

1  This  value  offers  support  to  stochastic  arguments  based  on  the  central  limit  theorem  which  sug¬ 
gest  a  value  of  when  Nk  models  are  available  to  the  system.  However,  there  are  no  computation¬ 
al  results  in  existence  for  N*>2,  so  this  suggestion  should  be  treated  as  speculative. 
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reduction.  Again  being  pessimistic,  models  3,  4  and  5  might  be  accepted,  and 
assuming  ek(pk~)<ek(pp),  ek5(pp ),  we  compute  the  potential  iterate  p *3.  Being 

pessimistic  a  final  time,  suppose  ek{pk)  <  y ek(pk ). 

Now,  it  is  unlikely  that  this  point  would  have  been  reached,  but  suppose  it  has 
been.  Then  model  4  has  predicted  aredk(pk)  twice  as  well  as  model  one,  has 
predicted  aredk(pp)  twice  as  well  as  model  two  (which  itself  was  the  best  predictor 
for  aredk(pk)  ),  and  now  predicts  aredk(pk)  twice  as  well  as  model  three  (which 
itself  predicted  aredk  (pp)  very  well).  Under  such  conditions  an  algorithm  would  be 
unreasonable  if  it  did  not  investigate  model  4  before  reducing  the  trust  radius. 

Thus,  even  for  relatively  large  Nk  and  small  (a,  the  algorithm  should  spend  only 
as  much  effort  in  switching  as  is  justified. 

These  arguments  are  somewhat  hypothetical,  but  if  logic  is  included  in  the 
switching  algorithm  to  increase  (i  to  a  large  value  after  two  models  have 
unsuccessfully  computed  steps,  then  it  should  be  clear  that  excessive  model  switching 
during  an  iteration  is  not  a  valid  concern. 

It  should  not  be  assumed  from  this  discussion  that  (a  should  always  be  chosen  to 
be  a  large  value.  Such  an  approach  is  conservative  with  respect  to  making  the  most 
use  of  the  different  models.  For  any  given  application  an  appropriate  (a  should  be 


found  to  ensure  a  reasonable  tradeoff  between  alternatives. 
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5.8.  Global  Convergence  Theory 

Consider  the  following  conditions. 

(A)  At  every  iteration  k  and  for  every  i  e/j,  pkeA(i,Ak)  implies  that 
predi  {pi)>CT,  llj?*  II  min{  Ak  ,  ll,^ll/(33  )  for  some  c3e  (0,1)  and  |33>0 
independent  of  k . 

(B)  At  every  iteration  k  and  for  every  /  €  Ik,  pk€A(i, Ak)  implies 
WpkW  <  (  1  +  <?[  )  Ak  for  some  at  e  [0.1)  independent  of  k . 

(C)  At  every  iteration  k,^q  e  Ik  such  that  predj/  is  a  quadratic  in  standard  form 
and  II  Bj}  II  <(32  for  some  [32>0  independent  of  k. 

We  then  have  the  following. 

Theorem  5. 1  :  Let  x  {  €  IR"  and  f  satisfy  the  standard  assumptions.  Suppose  that  an 
algorithm  implemented  consistently  with  A.4.3,  AS. 5. 3,  and  AS. 5. 5  is  applied 
to  f  starting  from  x  [t  and  let  conditions  (A),  (B),  and  (C)  hold.  Then  the 
algorithm  generates  a  sequence  [xk }  which  has  the  descent  property. 
Furthermore,  gk  — >  0. 

Proof  :  This  proof  is  in  three  parts.  In  (1)  we  establish  the  existence  of  {**  } 

and  the  descent  of  {/(**)}•  In  (2)  we  assume  that  {xk  }  remains  in  Qm  for  all  k  >m 

and  show  that  this  leads  to  a  contradiction.  In  (3)  we  prove  the  main  result  by 


appealing  to  Theorem  4.4. 
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(1)  We  first  show  that  a  sequence  {xk }  satisfying  the  conditions  of  the  algorithm 
exists  and  has  the  descent  property.  Since  Nk<°°  the  algorithm  will  eventually  either 
find  a  satisfactory  step  pk  with  A*>A*,  or  proceed  to  step  2.3)  for  trust  radius 
reduction.  In  the  latter  case,  consider  any  k  >  l  for  which  xke  L(f  ,*[).  Since  at 
least  one  model  at  every  iteration  is  a  standard  quadratic,  Lemma  4.2  establishes  that 
there  exists  a  8>0  and  index  i  for  which  Ak  <8  and  p k  s  A(ijxk)  imply  p^(p*)>Tli- 


Thus  for  A*  sufficiently  small,  the  algorithm  can  find2  a  satisfactory  step  pk. 

Furthermore,  any  step  pk  =pk  accepted  by  the  algorithm  must  satisfy  the 
condition  p k(pk)  >  T)b  which  with  (A)  implies 


aredk(pk)>  T) j  c3  WgkW  min 


A* 


IlgJI 


>  0 


(5.9) 


But  X  {  €  L  (f  JC  L)  so  that,  by  induction,  {xk  )  exists  and  satisfies  the  descent  condition. 
(2)  Let  Qm  be  defined  as  in  (4.19),  i.e. 

IU  II 

Qm  =  {x  :  II  x  -xm  II  <  and  x  e  L(f  pc  L) }  . 

i  Pi 

Consider  any  m  >  1  with  II  gm  II  ^  0.  We  want  to  show  that  some  later  iterate  xk  is 
not  in  Qm. 

First,  we  will  establish  some  properties  for  any  xk  e  £lm .  Lemma  4.3  implies 


2  Recall  that  we  did  not  specify  how  the  algorithm  would  find  pk  and  i ,  but  merely  stated  in  step 
2.3)  that  they  are  to  be  found.  In  this  theorem  we  show  that  permissible  values  exist,  and  hence  by 
the  statement  of  the  algorithm,  they  will  be  found.  In  practice,  one  would  use  an  procedure  similar  to 
A.5.4,  just  as  one  would  use  Procedure  A.4.5  with  the  algorithms  of  Chapter  4. 
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n«, ii  4"*.11 

so  that  (A)  implies 


(5.10) 


(5.11) 


Let  q  be  the  index  of  a  standard  quadratic  models  satisfying  condition  (C).  From 
Lemma  2.6,  (B),  (C),  and  (SA3) 

4  ( pj  )  =  I  aredk  (plk )  -  predgipj, )  I 

t 

<  \(pl)‘(Bg-H(xk  +W)(W)(  1  ~%)dZ, 

o  (5.12) 
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Hence 


predlk  ( plk  )>  —  c3  \\gm  li  Ak 


\  i-p ji(pi)i 


4  ( Pk ) 

I  pr^  (p^  )  1 


^■4  (Pk  ) 
C3  11  A* 


(5.15) 


(5.16) 


Therefore  for  any  A*  <  8^  and  xk  e  , 


<  tSt  ,p'  +  pj)A‘ 


( 1  +a,  )*• 

<  — n-V1  P. +  fc  >  K 

C  3  Sm  ^ 


(5.17) 


<  y(l-02)<  1-02 

and 

2  — 1)2  >  p^ip^) >  O2  •  (5.18) 

Now  that  we  have  established  these  results  for  any  A*  <  5*  and  xk  €  Qm ,  we  can 
proceed  with  our  contradiction. 

Consider  any  II  gm  II  *0  and  suppose  xk  e  Qm  for  all  k  >m  . 


(2a)  By  supposition,  II  gm  II  *0  and  xk  e  Q.m  for  all  k  >m  .  Suppose  further  that  lim 
sup  A*  <  8m .  Let  k  be  the  smallest  integer  for  which  k  >m  and  A*  <  5m  for  all  k  >k . 


108 


For  any  A*  <5m,  AS. 5.3  and  (C)  imply 

Ak  <  y3  max  {  Ak_{ ,  II pk-\^  }  ^  y3  (  1  +  a,  )  A*_,  (5.19) 

<  y3  (  1  +  a,  )  5m  <  8+  . 

Hence  (5.18)  implies  that  for  every  k  >k,  any  p/e  A(q,Ak)  computed  in  phase  2.2) 
satisfies 


P m)  >  r|2  •  (5-20) 

Thus,  if  q  €  lk ,  no  reduction  of  the  trust  radius  is  allowed.  Furthermore,  if  q  e  lk  and 
P*  (Pk  )^02 for  some  i  e  h> 


ek(pk) 

- —  =i  i -p kipit)  \  ^i-n2 

I predk(pj)  I 

and  hence  by  (5.14).  (5.15)  and  (5.19) 

ei(pi)>{\-r[1)\predi(pl)\  >y(l-r)2)c3  \\gm\\  Ak 

=  (Pi  +  (32  +  p3)(  l+at  )2  max{  (i,  1  }  5*  Ak 

>(P1  +  (32+P3)(  1+aj)2  max  { (i,  1  )  Ay 
so  that  (5. 12)  implies 


(5.21) 


(5.22) 


elk  (Pi  )  >  2 max  { |a,  1  }e^(pjc )  >  \i.e$(plk  )  .  (5.23) 

Therefore  phase  2.3)  cannot  be  executed,  and  hence  A^  >  Ak  for  every  k  >k. 

We  next  show  that  if  xk  e  Qm  for  all  k  >m,  then  Ak+i  ^  A*.  Consider  the 
computation  of  A*+1  as  specified  by  AS. 5. 5.  Since  A*  <8m,  (5.20)  and  (5.23)  and  the 
definition  of  the  algorithm  imply  x>  1.  Therefore 
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Afc  +  i  —  t  &k  —  ^k 


(5.24) 


for  every  k  >  ic .  Let  5m  sAftl.  Since  \  >\  and  A*  >A*_t  for  all  k  >ic. 


lim  inf  At  >  5m  >  0 


(5.25) 


Hence,  either 


8„,  >  lim  sup  At  >  lim  inf  A*  >  5m  >  0 


SUP  At 

^  oo  k  — *  oo 


(5.26) 


or 


lim  sup  A*  >  8m  >  0. 


(5.27) 


In  either  event,  lim  supA*  >  0. 
k 


(2b)  By  supposition.  Il*„ll  *0  and  xt  €  Cl„  for  all  k  >m  .  Conditions  (5.9)  and 
(5.10)  imply 

fk  ~fk+ 1  =aredk(pt)  (5’28a) 

>  r|t  predk(pk)  (5.28b) 

£  ytli  c3  IlgJI  min|  A*  ,  2?  }  '  (5'28c) 

By  the  standard  assumptions,  /  is  bounded  below  so  that  the  descent  condition 
implies  lim  (/*-/*+i)  =  0.  Therefore  lim  A*  =  0,  which  contradicts  (2a). 

Hence,  for  every  m  with  II  gm  I!  *0,  there  exists  m>m  for  which  x-  dQm. 


(3)  Part  (1)  of  this  proof  showed  that  {xk }  remains  inside  L{f  pc  t),  and  part  (2) 
showed  that  for  every  m  with  II  gm  II  *0,  there  exists  m>m  for  which 
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Hence,  eventually  some  X-  must  satisfy 


-xm  II  > 


2Pi 


(5.29) 


Therefore,  Theorem  4.4  implies  that  {gk  }  converges  to  zero.  Q 


5.9.  Other  Allowable  Options 

It  is  instructive  to  consider  several  slight  variations  to  the  previous  algorithms.  Two 
which  will  prove  useful  are  as  follows. 


Algorithm  Specification  AS.5.6  :  Permissible  Modification  to  AS.5.3  and 

AS. 5.5 

Let  80>0  be  given. 

(a)  Regardless  of  any  of  the  previously  specified  logic,  we  are  allowed  to  reduce 
the  trust  radius  via  step  2.3)  or  3)  at  iteration  k  whenever  A*>50.  If 
Ak  <  50 ,  the  previously  specified  logic  must  still  be  applied. 

(b)  Regardless  of  any  of  the  previously  specified  logic,  we  are  allowed  to  reduce 
the  trust  radius  via  step  2.3)  or  3)  at  iteration  k  whenever  Ak  >50H<?/H  for 
some  /  <k  ,  gt  *0.  If  Ak  <50llg/ll  .  the  previously  specified  logic  must 
still  be  applied. 


Theorem  5.2  :  Let  x  x  €  IR"  and  f  satisfy  the  standard  assumptions.  Suppose  that  an 
algorithm  implemented  consistently  with  A.4.3,  AS.5.3,  AS.5.5,  and 

modification  AS.5.6  is  applied  to  f  starting  from  xx,  and  let  conditions  (A), 
(B),  and  (C)  hold.  Then  the  algorithm  generates  a  sequence  { xk }  which  has 
the  descent  property.  Furthermore,  gk  -»  0. 


Ill 


Proof  :  The  proof  of  Theorem  5.1  need  only  be  modified  slightly  to  obtain 
these  results.  AS.5.6  differs  from  AS. 5.3  and  AS. 5.5  only  in  the  trust  radius 
modification  logic,  and  hence  only  section  (2)  of  the  proof  is  affected. 

To  establish  the  result  given  modification  (a),  we  need  merely  define  5m  by 

_5o _ 1-3 II  gm  11 _ u-n2) 

'  1  +  (T  j )  273(1+0!  )3  ((31  +  P2  +  (33)  max{ft,l} 

rather  than  by  (5.13).  With  (5.13)  replaced  by  (5.30),  the  hypotheses  in  section  (2a) 
of  the  proof  imply  that  Ak  <80.  Hence,  throughout  section  (2a),  the  trust  radius 
modification  logic  of  AS.5.3  and  AS. 5.5  must  be  used. 

To  establish  the  result  given  modification  (b),  define 

5^= ijS11**"  ‘  (5.31) 

gk*0 

By  hypothesis,  for  any  k>m,  xke  ,  so  for  any  /  < k ,  either 

(i)  l  <m  and  II  II  >5*,  or 

(ii)  /  >m  and  II  g,  II  > y  II  gm  II  >5£,  so  that  50  II  gt  II  >505^  >0.  We  then  replace 
(5.13)  by 

5m  =  min 


5n5fi 


C3  11  8n 


( i-n2) 


Y3  ( 1  +  o  i )  2y3(  1+aj)3  (Pi  +  P2  +  P3)  max{p,l} 


.  (5.32) 
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With  (5.13)  replaced  by  (5.32),  the  hypotheses  in  section  (2a)  of  the  proof  imply  that 
<5n  II  II .  Hence,  throughout  section  (2a),  the  trust  radius  modification  logic  of 
AS. 5. 3  and  AS. 5. 5  must  be  used.  Q 

AS. 5. 6  is  useful  in  the  case  where  evaluating  e[(plk)  for  the  model  with  index  j 
is  expensive.  For  example,  in  deciding  between  continuing  with  a  secant  model  or 
resetting  the  model  Hessian  to  the  exact  Hessian,  performing  an  e  -test  for  a  trial  step 
p  involves  either  calculating  the  exact  Hessian  immediately  or  approximating  the 
second  directional  derivative  of  /  in  the  direction  p.  Option  (i)  in  AS. 5. 6  allows  us 
to  specify  that  the  e-test  is  not  to  be  done  if  the  trust  radius  is  greater  than  a 
specified  minimum.  Furthermore,  consider  any  iteration  /  at  which  a  restart  is 
performed.  Examination  of  (SA3),  Lemma  4.2  and  the  inequality  (4.42)  applied  to 
Bk  =  Hk  shows  that  if  an  internal  doubling  procedure  is  executed  after  each  reset,  the 
step  selected  must  have  norm  greater  than  5  II  g/ II  for  some  8>0  independent  of  k. 
Hence,  option  (ii)  in  AS. 5.6  shows  that  an  acceptable  method  for  deciding  when  a 
reset  is  needed  is  to  save  the  norm  of  the  step  taken  at  the  last  reset  and  execute  a 
reset  when  the  current  trust  radius  drops  beneath  a  fixed  fraction  of  this  norm. 
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Algorithm  Specification  AS.5.7  :  Permissible  modification  to  AS.5.3  and 
AS.5.5 

In  addition  to  the  logic  previously  specified  for  step  2.2)  of  iteration  k ,  we  are  al¬ 
lowed  to  accept  Ak  =  Ak  and  pk<(\+d[)Ak  if  there  exists  a  model  with  in¬ 
dex  i  and  step  pk  €  A (i , Ak)  for  which  and  aredk(pk)>aredk(plk). 


Theorem  5.3  :  Let  .t  j  €  lRn  and  f  satisfy  the  standard  assumptions.  Suppose  that  an 
algorithm  implemented  consistently  with  A.4.3,  AS5.3,  AS.55,  and 

modification  AS.5.7  is  applied  to  f  starting  from  x  and  let  conditions  (A), 
(B),  and  (C)  hold.  Then  the  algorithm  generates  a  sequence  [xk }  which  has 
the  descent  property.  Furthermore,  gk  -»0. 

Proof  :  To  establish  the  result  for  modification  (b),  we  need  only  change 
(5.28a)  to 

fk-fk+i  =  aredk(Pk)  -  aredk{plk)  .  (5.33) 

The  remainder  of  the  proof  is  unchanged.  Q 

AS.5.7  is  motivated  by  the  switching  logic  in  NL2SOL.  Let  us  denote  the 
currently  preferred  model  by  “c”  and  the  alternative  by  “a.”  The  switching  logic  of 
NL2SOL  is  admissible  under  AS.5.3,  AS.5.5  and  AS.5.7  except  in  one  instance,  when 
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we  have 


P*(P/t)<Th- and  e^Pk) > P ^k(Pk)  (5.34) 

and 

P*(P*)-Tl2’  and  aredic(p£)> aredk(pk).  (5.35) 

Now,  since  NL2SOL  selects  its  new  model  based  on  the  largest  actual  reduction  in 
this  case,  the  current  model  predk  is  retained.  Furthermore,  NL2SOL  bases  its 
decision  about  trust  radius  reduction  on  pk(pk)  even  though  pkiPk)^\i  (which  would 
prohibit  trust  radius  reduction  in  our  framework). 

If  t|2>  PkiPk)  t^ien  die  l°gic  °f  NL2SOL  will  reduce  the  trial  radius  before 
the  next  iteration,  but  will  also  try  the  alternate  model  first  at  the  next  iteration  since 
ek(pk) > P ek  ( Pk  )•  ^  can  be  shown  that  this  switch  and  the  “internal  doubling”  logic 
of  NL2SOL  are  sufficient  to  negate  the  effects  of  the  trust  radius  reduction  when  Ak 
is  sufficiently  small.  However,  if  p£(p*)<rh,  an  internal  reduction  cycle  is  executed 
which  terminates  with  no  assurance  of  a  model  switch.  We  therefore  prefer  the 
following. 

For  the  instance  where  T)2  >  P*  (P*  )  —  “H  i»  we  consider  the  decision  not  to  reduce 
the  trust  radius  to  be  the  superior  strategy  (particularly  if  the  alternative  model  is  to 
be  tried  first  next  step  anyway).  The  decision  to  select  pk  is  also  superior  and  is 
allowed  by  AS. 5. 3  and,  AS. 5. 5. 

If  pk  (pk ) <  ri x ,  AS. 5. 7  allows  pk  to  be  selected  also  as  long  as  no  reduction  is 
subsequently  done.  We  support  our  arguments  against  decreasing  the  trust  radius  by 
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pointing  out  that  model  •‘a”  did  a  significantly  better  job  of  prediction  than  model 
”c"  by  two  different  criteria,  while  model  “c”  picked  a  good  direction  but  predicted 
the  reduction  poorly.  The  possibility  that  model  c  selected  the  better  step  by 
serendipity  is  strong  enough  to  support  not  reducing  A* . 

The  following  algorithm  is  an  example  of  a  two  model  system  with  no  a  priori 
preference  between  models.  It  is  quite  similar  to  NL2SOL,  but  with  the  changes 
noted  above.  The  index  “c”  again  denotes  the  currently  preferred  model,  while  “a” 
denotes  the  alternative. 
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Algorithm  A. 5.8  :  Nonpreferential  Model  Switching 


(J)  Let  p>0,  0<ri1<ri2^Tl3<l>  0 < <  1<y2<Y3>  At>0  and.Vje  IR"  be  given. 
Set  t  =  l  and  select  a  trial  model  c . 


1 )  Repeat  until  convergence  : 

2)  Compute  pk  e  A  (c  A k ). 


3)  If  pLk(Pk)<r\2' lhen 

If  e£(pt)  <  then 

Compute  pk<=  A  (a  Ak ). 

If  max  {  pk(pk) ,  pkiPk)  1  <Tli  then  go  to  5). 
If  p^Pk)  <  ’ll then  set  ts  ( 0 ,  Yt  ]- 
If  aredk{pk)>aredk(pk )<  then 
Set  xk+l  =xk  +pck. 

Go  to  7). 

Else 

Set  xk+l=xk+Pk- 
Switch  a  and  c . 

Go  to  7). 

End  if. 


Else 

If  p* (/?*)  < T1 1  then  go  to  5). 
Set  xk+l=xk  +p£. 

Set  xe  ( 0 , Yt  3- 
Go  to  7). 

End  if. 

End  if. 


4)  Set  xk+l  =xk  +pk. 

If  Pk(pk)  —  TI3.  then  set  te  [y2,Y33- 
Go  to  7). 

5)  Do  internal  reduction  via  Procedure  A.5.4. 


6)  Set  xk+l  =xk  +pck. 


7)  Set  Ak+[  =x  A^. 


8)  Set  t=  1,  update  the  models,  and 

select  a  trial  model  for  initial  use  in  the  next  iteration. 
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9)  End. 

If  both  models  are  quadratic.  Procedure  A.5.4  in  step  5)  can  be  replaced  by  A.4.5. 
Corollary  5.4  :  Given  the  assumptions  of  Theorem  5.3,  Algorithm  A.5.8  is  FOSPC. 


Proof  :  Follows  immediately  from  Theorem  5.3,  Lemma  4.2,  and  examination 
of  the  algorithm.  Q 

Another  possible  option  for  these  algorithms  is  to  include  logic  which  explicitly 
uses  knowledge  about  which  models  are  quadratic  with  uniformly  bounded  curvature, 
and  which  models  are  only  known  to  satisfy  the  UPD  condition.  For  example,  we  can 
show  that  it  is  admissible  to  decrease  the  trust  radius  whenever  pg(pjD<T\2,  or  if 
pl(pl)<r\2  and  |4 ei(pi)>ei(pt>'  where  the  model  with  index  q  is  known  to  be 
quadratic  with  uniformly  bounded  curvature.  However,  the  only  use  these  changes 
might  have  would  be  to  increase  the  efficiency  of  a  system  with  several  models  by 
allowing  A*  reduction  without  having  to  do  as  much  testing  between  models.  Since 
this  task  can  be  done  better  by  simply  increasing  the  value  of  [i,  we  find  nothing  to 
recommend  such  preferential  logic. 

5.10.  Summary  :  Type  Two  Algorithms 

Definition  5.2  :  The  general  Algorithm  A.4.3,  when  applied  consistently  with  the 
specifications  in  AS. 5. 3,  AS. 5. 5,  AS. 5. 6,  and  AS. 5. 7,  is  called  a  type  two 
algorithm. 
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This  class  of  algorithms  admits  a  very  broad  range  of  implementations  of  model 
switching  and  trust  region  logic.  In  particular,  the  major  features  of  NL2SOL 
switching  logic  are  admissible.  There  is  no  restriction  on  the  number  of  models 
allowed,  or  on  the  method  of  choosing  an  initial  trial  model,  and  the  algorithm  is  not 
required  to  use  a  priori  information  specifying  which  model  has  uniformly  bounded 
curvature.  There  is  no  requirement  forcing  continuation  of  the  switching  logic  tests 
once  a  trust  radius  reduction  phase  has  started,  and  no  requirement  on  how  the  trust 
radius  is  decreased  as  long  as  such  decrease  is  allowed  and  as  long  as  a  satisfactory 
step  is  eventually  found. 

Furthermore,  the  class  of  algorithms  based  on  Reduction_Test_l  can  be 
implemented  with  high  efficiency,  even  for  a  large  number  of  models.  These 
algorithms  can  be  written  to  bias  toward  the  use  of  a  particular  model  if  it  is  known 
to  be  inexpensive  or  suspected  of  being  faster  than  the  alternative,  or  they  can  treat 
the  models  solely  on  the  basis  of  their  performance  as  in  NL2SOL. 

Finally,  only  two  requirements  are  imposed  on  the  models.  At  least  one  model 
must  be  a  quadratic  in  standard  form  with  uniformly  bounded  curvature,  and  each  of 
the  models  and  associated  procedure  for  computing  a  trial  step  must  satisfy  the  UPD 


condition. 


CHAPTER  6 


Safe  Algorithms  using  Minimal  Restrictions  on  the  Models 

6.1.  Introduction 

In  this  chapter,  we  only  assume  that  one  model  is  a  standard  quadratic  with  uniformly 
bounded  curvature.  It  is  implicit  in  the  definition  of  the  previous  algorithms  that  the 
number  of  models  need  not  be  constant  from  iteration  to  iteration.  Thus  any  type  two 
algorithm  that  deletes  all  models  not  known  to  satisfy  the  uniform  predicted  decrease 
condition  is  still  first  order  stationary  point  convergent.  Alternately,  if  all  such 
models  are  modified  so  that  the  condition  is  assured,  then  any  type  two  algorithm 
using  these  models  is  first  order  stationary  point  convergent.  A  type  two  algorithm 
that  deletes  or  modifies  models  not  known  to  satisfy  the  uniform  predicted  decrease 
condition  is  said  to  be  a  type  three  algorithm. 


6.2.  Enforcing  the  UPD  condition  by  Direct  Methods 


One  technique  for  ensuring  that  (5.1)  holds  for  each  p[  considered  is  to  directly 
enforce  such  a  condition.  The  expression 


minredk  (A *)  =  c0  ^  8k  ^  j  At  »  — jr 
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can  be  used  to  decide  whether  to  delete  a  model  and  its  associated  trial  step  from 
consideration  at  the  current  iteration.  The  parameter  c0  is  set  to  a  value  in  (0,0.5). 
The  variable  qk  >0  is  an  overestimate  of  the  maximum  curvature  of/.  It  can  be  set 
to  a  constant  value,  such  as  106.  We  then  define 


UPD_Testk  (Ak )  ={ 


false  if  Predk  ( Pk)<  minredk  (A* ) 
true  if  predk  ( p* )  > minredk ( A* ) 


(6.2) 


6.2.1.  A  Globally  Convergent  Modification  of  NL2SOL 


The  following  algorithm  is  a  modification  of  Algorithm  A.5.8  that  is  globally 
convergent  under  even  milder  assumptions  on  the  models.  Specifically,  one  of  the 
two  models  used  is  assumed  to  be  a  quadratic  in  standard  form,  but  no  assumptions 
whatsoever  are  made  concerning  the  alternate  model.  The  model  switching  scheme  is 
a  very  slight  variation  of  that  used  by  NL2SOL.  The  algorithm  itself  can  be  applied 
to  NLS  or  general  unconstrained  optimization.  For  simplicity  we  have  left  out  the 
internal  doubling  option.  As  in  Algorithm  A.5.8,  the  current  model  is  denoted  by  the 
index  c  and  the  alternate  is  denoted  by  the  index  a .  One  of  the  models,  denoted  by 
the  index  q ,  is  a  quadratic  in  standard  form. 


Algorithm  A.6.1  :  A  Globally  Convergent  Model  Switching  Algorithm 

0)  Let  jx > 0,  c0e  (0,0.5),  0<qi<q2<Tl3<l,  (Ky^  1  <y2<y3,  Jc^l  1.-).  ^i>°  and 
x  t  e  IR"  all  be  given. 


Select  a  trial  model  with  index  c . 


1 )  Repeat  until  convergence  : 

2)  Compute  an  overestimate  \k  to  the  curvature  of  the  problem. 

3)  Compute  pk  A  {c  ,At ). 

4)  If  c*q  and  UPD_Testk(Ak)  =  false,  then 

Switch  model  preference  by  switching  the  indices  a  and  c. 
Compute  pk  €  A  (c  Ak ). 

If  pk(pk)<r\\  then  go  to  8). 

End  if. 

5)  If  pk(pk)>r\2,  then 

Set  xk+l  =xk  +p[. 

If  PkiPk)-^  then  set  xe  [Y:.Y?1- 
Go  to  9). 

End  if. 

6)  If  ek(pk)<  —ek(pk),  then 

M* 

Compute  pk€  A  (a  Ak ). 

If  a  and  U  PD  Jest*  { Ak)=  false,  then 
If  P*c(P*c)<Tli  then  go  to  8). 

Set  ts  (O.Yi  ]• 

Set  xk+i  =  xk  +pk. 

Go  to  9). 

End  if. 

Set  pk  =max  {p*c (pck  ),p*  (pk  )}• 

If  p^  <r),  then  go  to  8). 

If  p*  <p2  then  set  xe  (0,Yi  ]• 

If  aredk(pk)>aredk(pk),  then 
Set  xk+x=xk+pck. 

Go  to  9). 

Else, 

Set  xk+l-xk  +pk. 

Switch  a  and  c. 

Go  to  9). 

End  if. 

End  if. 

7)  If  Pi(pD  — tli,  then 
Set  xk+l=xk  4 -pck. 
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Set  xe  (O.Yil- 
Go  to  9). 

End  if. 

8)  For  ;'=1.2,...  do  : 

Set  A*  e  (O.Yi  A*  ]. 

Compute  pi  eA(c,At). 

[f  C*q  and  UPD_Testck{ A*)  =  false,  then 

Switch  model  preference  by  switching  the  indices  a  and  c . 

Compute  p[  e  A{c  ,At ). 

End  if. 

if  p )>m. then  set  -**+1=**  +p* and  g° t0  9)- 
Lf  c  and  mod(j ,jc)  =  0,  then 

Switch  model  preference  by  switching  the  indices  a  and  c. 

End  if. 

End  for. 

9)  If  x <  L  then  set  A*+1=xll  11 , 

Else  set  A*+1=tA*. 

10)  Set  x  =  1,  update  the  models,  and  select  a  trial  model  for  the  next  iteration. 

11)  End. 


Several  features  of  this  algorithm  deserve  comment.  First,  the  parameter  jc  is 
the  integer  introduced  in  Procedure  A.5.4,  and  the  function  modtj ,jc)  is  the  standard 
“modulus”  operator  that  has  value  0  if  and  only  if  j  =jc.  The  portions  of  the 
algorithm  involving  jc  are  included  to  guarantee  successful  termination  of  the  inner 
reduction  loop  8)  for  general  models  and  are  are  not  essential  if  both  models  are 
quadratics.  A  suggested  value  for  jc  is  5. 

Second,  this  algofithm  is  much  less  complicated  than  superficial  appearances 
might  indicate.  Denote  the  alternative  to  model  q  with  the  index  d.  A.6. 1  differs 
from  A.5.8  only  in  that  it  deletes  model  d  and  step  pjf  if  predfc (pjj)  is  not  as  large  as 


expected. 


Third,  no  restrictions  are  placed  on  how  the  new  trial  model  is  selected  at  the 
end  of  each  iteration.  This  decision  could  be  made  by  any  of  the  tests  for  model 
selection  discussed  in  Chapter  3.  Another  possibility  is  to  continue  to  use  the  model 
which  was  successful  at  the  current  iteration.  Yet  another  would  be  to  initially  try 
the  same  model  at  every  iteration. 

Using  the  theory  presented  in  Chapter  5,  we  can  establish  conditions  under 
which  this  algorithm  is  globally  convergent.  Consider  the  following  specifications. 

(A)  One  of  the  models,  denoted  by  the  index  q,  is  a  quadratic  in  standard  form  for 

which  lim  sup  II  B{!  II  <°«  . 
k  — >  ~ 

(B)  If  pj!  <=  A  (q  ,Ak ),  then  pg  is  either  an  FCD  step  or  an  OLC  step. 

(C)  For  every  i ,  plke  A  (i  ,Ak )  implies  1 1  plk  1 1  <  ( 1  +  )  Ak  for  some  al  e  [0, 1 ). 

We  then  have  the  following  result. 

Theorem  6.1  :  For  a  given  xxe  IR",  assume  that  f  satisfies  the  standard 
assumptions  and  that  (A),  (B),  and  (C)  hold.  If  A.6.1  is  applied  to  f  starting 
from  xx  and  [%k  }  is  bounded ,  then  the  algorithm  generates  a  sequence  { xk  } 
which  is  FOSPC. 

Proof  :  Any  step  pk  accepted  by  the  algorithm  satisfies  aredk(pk)> 0,  so  {xk  ) 
remains  within  L(f  qcf.  Inspection  of  A.6.1  shows  the  existence  of  pk  unless  step  8) 
is  unable  to  find  a  step  satisfying  pkeA(i,Ak)  and  p*(p*)^Tli-  ®ut  a^ter  Jc<00 


124 


executions  of  this  loop,  the  model  preference  is  switched  to  q .  Condition  (A)  states 
that  this  model  is  a  quadratic  in  standard  form;  and  Lemma  4.2  establishes  the 
existence  at  each  iteration  of  a  positive  scalar  5  for  which  p§ e  A(q  ,Ak)  and  A*  <5 
implies  p g(pg)>r\i.  Since  yt<  1,  the  algorithm  must  eventually  either  accept  a  step 
with  A*  >5  or  set  A*  <8.  Hence  A.6.1  successfully  generates  a  step  pk  at  every 
iteration. 

Next,  conditions  (A),  (B),  and  Lemma  4.1  imply  that  the  model  with  index  q 
generates  steps  p%  satisfying  the  UPD  condition.  If  the  model  with  index  d  generates 
a  step  not  satisfying  predk(pk)  >  minredk  (Ak ),  it  is  deleted  from  consideration.  Since 
}  is  bounded,  each  step  pk  considered  satisfies  the  UPD  condition.  Therefore  this 
algorithm  is  an  admissible  variation  of  AS. 5. 7  and  thus  is  FOSPC  by  Theorem  5.3. 

□ 

6.2.2.  Some  Remarks  on  Direct  Methods 

Direct  tests,  such  as  used  in  the  last  example,  are  conceptually  different  from  other 
tests  we  have  presented.  Notice  that  A.6. 1  deletes  models  and  associated  steps  before 
evaluating  the  actual  function  reduction.  Direct  tests  should  therefore  be  thought  of 
as  a  way  of  limiting  which  models  to  include  in  lk  at  a  given  iteration  and  should  not 
be  considered  part  of  the  model  switching  logic  per  se. 

Moreover,  great  c^re  should  be  taken  in  implementing  these  tests  as  they  can  be 
scale  dependent.  Suppose,  for  example,  that  a  NLS  algorithm  using  a  secant  model 
and  a  Gauss-Newton  model  sets  c0=0.5  and  defines  the  sequence  {^k  }  by 
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%k  =  11  J(xk)lJ(xk)\\ .  If  S(jc)  is  positive  definite,  II  H(x)  II  may  typically  be  much 
larger  than  II  J  (x )‘ J  (x )  II ,  so  the  secant  model  may  be  deleted  even  if  it  does  a 
much  better  job  of  predicting  the  actual  function  reduction  than  the  Gauss-Newton 
model.  Even  more  striking  is  the  case  of  using  the  identity  matrix  for  Bg  and  using  a 
secant  method  to  generate  the  other  model  Hessian.  A  naive  implementation  might  set 
c0  =  0.5  and  ^  =  11  /  II  =1.  Again,  the  secant  model  may  be  frequently  rejected  in 
favor  of  the  model  using  Bj-l  even  if  it  does  a  much  better  job  of  predicting  the 
actual  function  reduction. 

To  avoid  spuriously  deleting  valid  models,  c0  should  be  set  to  a  very  small 
value.  We  suggest  c0  =  0.001.  As  previously  stated,  the  sequence  {£*}  should  be  an 
overestimate  to  the  curvature  of  the  problem. 

On  the  other  hand,  direct  tests  have  several  advantages.  They  are  simple  and 
inexpensive  to  implement.  They  can  be  applied  to  any  models  whether  quadratic  or 
not.  Also,  use  of  minredk  to  delete  models  is  much  better  than,  say,  deleting  models 
which  violate  1 1  Bk  1 1  >£,*. 

6.2.3.  Estimating  an  Upper  Bound  for  the  Curvature  of  a  Function 

Direct  tests  are  less  sensitive  to  the  scale  of  the  problem  if  the  curvature  estimates 
used  to  define  t,k  are  determined  from  the  behavior  of  the  function,  rather  than  from 
the  curvature  of  one  oh  the  models.  Such  estimates  can  be  produced  as  follows. 

Consider  a  secant  method  for  “updating”  a  model  Hessian  Bk.  Such  a  method 
generates  a  new  model  Hessian  Bk+l  satisfying  the  secant  equation 
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Bk+iPk=yk  (6.3) 

1 

for  some  yke\Rn.  Define  Hk+l=jH(xk+Xpk)dX  and  recall  that  Lemma  2.4 

o 

established 

l 

fH{xk+\pk)pkd\=gk+l-gk  .  (6.4) 

o 

Thus,  if  yk  is  defined  to  be  gk+l  -gk,  a  matrix  Bk+{  satisfying  the  secant  equation  has 
roughly  the  same  curvature  as  Hk+l  in  the  direction  pk.  For  nonzero  pk,  this  suggests 
the  curvature  estimate 


^+i=  max-! 


£*-i  • 


Pk  Bk+lpk 


PkPk 


=  maxs 


t  1  pk  yk 

s*-i » — : — 
PkPk 


This  estimator  is  bounded  if  there  exist  P4>0  such  that 


plkyk  1  ^  P4  II  pk 


(6.5) 


(6.6) 


For  the  standard  definition  of  yk ,  this  is  easily  established. 

Lemma  6.2  :  For  a  given  x  { e  1R" ,  assume  that  f  satisfies  the  standard  assumptions. 
For  any  xk+l  and  xk  in  L  if  jcy),  let  yk  be  defined  as 


yk  -£*+i  ~8k 


(6.7) 


Then 
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I  y[pk  I  <(3i  Mpyt  II 2  •  (6-8) 

Proof  :  From  the  Cauchy-Schwartz  inequality,  I  vkpk  I  £  Up*  II  11  vj! .  Lemma 
2.5  and  (SA3)  imply  II  gk+l~%k  11  "P*11  -  which  leads  immediately  to  the  result. 

□ 

Other  formula  are  sometimes  used  to  define  yk .  Alternate  definitions  of  yk  have 
been  created1  for  NLS  problems  in  order  to  take  advantage  of  the  problem  structure. 
For  most  of  these  alternate  definitions,  mild  conditions  on  r  are  sufficient  to  establish 

—  is  bounded.  For  example,  DGW  [1981]  make  the  definition 

II  pk  II2 

yk  ~^k+[rk+\  —  Jlkrli+\  +^k+l^k+\Pk  ■  (6.9) 

Consider  the  following  conditions. 

(j)  The  sequence  { xk  }  satisfies  the  descent  property  with  respect  to 
fix)  =  y  llr(.t)ll22. 

(ii)  There  exists  (3y  <  °°  such  that  for  any  x  €  L  if  pc  t),  1 1  7  (r )  1 1  <  (3y . 

(iii)  J(x)  is  Lipschitz  continuous  over  L(f  so  that  for  every  x  ,y  s  L(f  pc  i) 
we  have  II  J(x)-J(y )  II  ^ (3/;p  W  x  -y  II . 

Lemma  6.3  :  Let  yk  be  defined,  by  (6.9)  and  suppose  that  (i),  (ii)  and  (iii)  above  are 
true.  Then  there  exists  (34<=°  such  that 

1  See,  for  example.  DGW  [1981],  Dennis,  Sheng,  Vu  [1985],  Betts  [1976],  Dennis  [1973],  and 
Al-Baali,  Fletcher  [1985]. 
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y'kPk  1  —  $4  ^  Pk 


Proof  :  First,  we  have 


ykPk  -Pk  (Jkl{  ~Jlk  )rk+[+PkJk+\Jk+iPk 
=  U  Jk+{pkU2+Pk  J(Xk+Pk)‘ 


'*+ 1 


From  the  Cauchy- Schwartz  inequality 

I  pkyk  I  <  II  Jk  +  [  II  ‘H  Pk  II  ~  +  ^  Pk  ^  ^  J  ( Xk+Pk)‘  ~J  (xk)‘  ]  rk+ 1 
so  that  applying  (i),  (ii)  and  (iii)  gives 


plyk  1  £ 

< 


Py  2  +  P/i>  "  r*+i 


Py  2+Pt<p  ^  2/ (x*) 

Py  2  +  hP  V27W 


lPk 


Pk 


Pk 


(6.10) 


(6.11) 


(6.12) 


Therefore  defining  P4  =  Py  2  +  (3,,p  V  2  f(x{)  completes  the  proof.  □ 
We  make  use  of  the  following  procedure  to  compute  \k . 
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Procedure  A.6.2  :  Computing  a  bounded  overestimate  for  the  curvature 
of  a  function 

If  k  =  1  then  set  ^  =  106. 

If  k  >  1  and  II  pk_y  II  *0,  then 


Set 


^  =  max  {  ^_1?  100  x 


yk-iPk-i 

Pk-iPk-i 


}• 


End  if. 


Implementation  of  this  procedure  under  the  assumptions  of  Lemma  6.2  or 
Lemma  6.3  generates  a  bounded  sequence  {^k }  which  can  be  used  in  A.6.1  or  other 
algorithms  using  direct  tests  to  enforce  UPD.  An  alternative  is  to  use  a  fixed 
curvature  estimator,  such  as  ^  =  106. 

6.3.  The  UPD  Condition  and  Quadratic  Models 

In  Chapter  4,  we  considered  quadratics  in  standard  form  and  assumed  that  steps  were 
computed  using  FCD  or  OLC  procedures.  Lemma  4.1  established  that 

lim  sup  II  B*  II  <~  (6.13) 

k  — 

is  sufficient  to  imply  the  UPD  condition  for  the  model  with  index  i .  However,  much 
weaker  assumptions  can  be  used  to  establish  the  UPD  condition.  These  weaker 
conditions  differ  not  only  for  different  categories  of  models  but  also  for  different  step 


generation  procedures. 
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6.3.1.  Sufficient  Conditions  for  UPD  with  Quadratic  Models 

Before  proving  our  theorem,  we  review  the  following  results  concerning 
quadratics  in  one  variable  and  their  extrema.  Consider  a  nonzero  vector  w  e  I Rn .  For 
a  quadratic  model  in  standard  form  at  iteration  k  with  model  Hessian  Bk  we  have 

predlk  (aw)  =  -ag‘k  w  - y  a2w'  Blk  w  .  (6.14) 

Now,  if  w‘  Bk  w  >0,  we  have  that 

maximize predk  (aw)  (6. 1 5) 

a 

*  8kw 

is  solved  by  a  = - : — ,  with 

w‘  Blk  w 

predk{  a*w)  =  |  (6.16) 

^  w  Bkw 

Furthermore,  for  a  €  [0 ,  a  ] 

cx 

-  —  gl  w  <predk  (aw  )<-ag‘kw  .  (6.17) 

If  w‘ Bkw  <0,  we  have  that 

predk  (aw)>-agkw.  (6.18) 

for  all  a.  We  then  have  3  possible  cases  in  solving 

maximize predk  ( a  w )  .  (6.19) 

a 

sit  II  a  w  II  <A* 

(1)  If  w‘Bkw> 0  and  II  a*w  II  then  (6.19)  is  solved  by  a*  as  before,  with 


maximum 
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t  ,.,\2 


predk  (a*  w)  = 


1  (gkw) 


2  w‘Bkw 

(2)  If  w‘Bkw  >0  and  II  a*  w  II  >Ak,  then  (6.19)  is  solved  by 


a**  =  -sign  (g‘kw)  Ak  /  II  w 


with  maximum 


predk  (a  w)  =  sign  ( g[  w)-^—fi~Ak 
Furthermore 


1  2  w  Bjw 

2  k  II  w  II  2 


1  A  1  g‘k w  1  ^  ..i,  **  ,  ^  4  1  gk  w  I 

TA*iunr spredt(a  w)£A*-iMT' 

(3)  If  w‘ Bkw  < 0,  then  (6.19)  may  have  either  one  or  two  local 
corresponding  to  a***  =±  ..  *.  If  gkw  =0,  two  maxima  exist  with 

1 1  W  1 1 


pred‘k  (a***  w)  =  -y 


w 


w  Blkw  . 


If  not,  the  global  maximizer  is  unique  so  that 


***  .  /  /  \ 
a  =-sign(gkw ) 


II  w  II 


and 


pred'k(a**  w)=\g‘kw\ 


w 


w‘  Bkw 


In  either  event,  the  maximum  is  given  by  (6.26).  Furthermore, 


I  gl  w  I  *** 


(6.20) 

(6.21) 

(6.22) 

(6.23) 
solutions 

(6.24) 

(6.25) 

(6.26) 


(6.27) 
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The  following  theorem  applies  to  quadratic  models  that  use  FCD  or  OLC 
procedures  to  generate  trial  steps. 

Theorem  6.4  :  Consider  a  quadratic  model  in  standard  form  with  model  Hessian  Bk 
and  an  associated  step  p‘k . 


(a)  If  pk  is  an  FCD  step  with  constant  Cj>0  and 


8k  Bi  gk  <  P3  8k8k 


(6.28) 


for  some  (33>0,  then 


1  llfo 

predj,  (pk)  >  —  C\  II  gk  II  mnvj  A*  — 


(6.29) 


(h)  Let  pk  be  an  OLC  step  with  constant  c  2  >  0  and  let  wk  e  1R"  be  a  nonzero 
vector  satisfying 


wk  Bk  wk  <p4  w‘kwk 


and 


(6.30) 


I  glkwk  I  >  e4  IlgJI  II  wk 


(6.31) 


for  some  P4>0,  e4>0.  Then 


1  )  e4 1 1  gk 

predlk  (pj)>  —  c2e4  11  8k  11  min“j  Ak  . - 


(6.32) 


Proof 


(a)  Let  pk  be  an  FCD  step  so  that 
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predj.  (p’l)>cx  maxpredlk  (a  gk)  . 
a 

sit  II  a  £*11  <A* 


(6.33) 


Now  let  w  be  replaced  by  gk  in  (6.19),  (6.20),  (6.23)  and  (6.27)  so  that  (6.28) 
implies 


predk  ( pjc  )  —  f !  min 


1  1  SkSk 

2  giBlgk  '1  >  H&ll 


(6.34) 


>-cl\\gkW  min<| 


At , 


IlfoH 

P3 


which  establishes  the  part  (a)  of  the  theorem. 


(b)  Let  plk  satisfy  OLC  condition  so  that 


predk  ( pi)>c2  max  predlk  ip ) 
p 

sit  lip  II  <  A* 

> c2 max predk  (a wk). 

a 


(6.35) 


sit  II  a wk  II  <  A* 


From  (6. 19), (6.20),  (6.23),  (6.27),  (6.30)  and  (6.31)  we  obtain 
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predlk{plk)>c2  mirW 


1  1  8k™k  >2  1  I  8k™k  |2 

2  w‘kB^k  2  M  wk  II 


>y  c2min^ 


£4  II  8k  II 2  II  wk  M  2  .  e4ll5*11  H^11 
“At  “ 


w[Blkwk 


\\wk\\ 


c2e4 


S*11  minK 


Ak  , 


£4llgJI  IlMfrll 
w[Blkwk 


(6.36) 


11 8k  11  minj  IlgJI 

which  establishes  the  part  (b)  of  the  theorem.  Q 
6.3.2.  Directionally  Bounded  Curvature 

Consider  a  sequence  of  nonzero  vectors  { vk  }  in  IR" .  A  sequence  of  quadratic 
models  in  standard  form  is  said  to  be  directionally  bounded  in  curvature  with  respect 
to  { v* }  if  there  exists  P3  >  0  such  that 


viBivk  ^  h  vlvk 


(6.37) 


for  all  k .  If  there  exist  e4  >  0  and  p4  >  0  such  that  for  any  k  there  exists  a  nonzero 
wk  €  IR"  satisfying 

w‘kBkwk  ^  P4  w‘kwk  (6-38) 

and 


I  vk  wk  I  >  e4  1 1  vk  1 1  1 1  1 1 


(6.39) 


then  the  sequence  of  models  is  said  to  be  quasi-directionally  bounded  in  curvature 
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with  respect  to  {v*  }. 

Corollary  6.5  :  Consider  a  sequence  of  quadratic  models  in  standard  form  with  an 
associated  procedure  for  computing  steps. 

(a)  If  the  procedure  computes  FCD  steps  and  the  sequence  of  models  is  of 
directionally  bounded  curvature  with  respect  to  {gk },  then  the  UPD  condition 
holds. 

(b)  If  the  procedure  computes  OLC  steps  and  the  sequence  of  models  is  of 
quasi-directionally  bounded  curvature  with  respect  to  {g* },  then  the  UPD 
condition  holds. 

Proof  :  To  establish  (a),  let  the  constant  c3  in  (5.1)  be  defined  as  c3  =  yc1. 
Then  (a)  follows  immediately  from  Theorem  6.4.  To  establish  (b),  let  the  constants 

1  P4 

in  (5.1)  be  defined  as  c3  =  — c2e 4  and  (33  =  — .  Then  (b)  follows  immediately  from 

2  £4 

Theorem  6.4.  Q 

6.3.3.  Relation  Between  Directionally  Bounded  Curvature  and  Uniformly 
Bounded  Curvature 

Consider  a  sequence  of  models  which  is  directionally  bounded  in  curvature  with 
respect  to  {gk  }  so  that 
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gkBigt  ^  P3  gigk  (6.40) 

for  some  p3.  We  immediately  see  that  (6.40)  is  weaker  than  (6.13);  and  hence 
Lemma  4.1  is  a  special  case  of  Theorem  6.4. 


6.3.4.  Quasi-directionally  Bounded  Curvature  and  Secant  Methods 

A  sequence  of  models  is  quasi-directionally  bounded  in  curvature  with  respect  to 
{g*  }  only  if  there  exist  e4  >  0  and  (34  >  0  for  which 


KBkwk  ^  p4  w‘kwk 


(6.41) 


and 


I  gjwk  I  >  e4  It gk\\  \\wk\\  (6.42) 

for  some  nonzero  wk  6  IR"  at  every  iteration.  We  immediately  see  that  (6.41)  and 
(6.42)  are  a  strict  relaxation  of  (6.40). 

Next,  consider  a  secant  method  producing  model  Hessians  which  satisfy 
Bk Pk-\=yk-i-  Substituting  pk_i  for  wk  in  (6.41)  and  (6.42)  gives  the  expressions 

Pk-iBkPk-\  ^  Pa  Pk-iPk-i  (6-43) 

and 


I  gkPk- 1  1  ^  e4  WPk- 111  • 


(6.44) 


As  noted  earlier,  most  definitions  of  yk  ensure  (under  mild  assumptions)  that 


Pk  yk 
pi  Pk 


is  bounded.  If  so,  there  exists  (34  <  <»  for  which 
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%pLiPk-i  *  pi-iyt-i 

=  Pk-iBkPk-i  ■ 


(6.45) 


I 


Pk  yk 


Hence  by  Theorem  6.4,  if  (a)  - is  bounded  by  (34,  (b)  B‘k  satisfies  the  secant 

PkPk 

equation  Bk  pk_x  =yk-\  and  (c)  pk  is  computed  by  an  OLC  procedure,  then  (6.44)  is  a 
sufficient  condition  to  imply  the  UPD  condition. 

6.3.5.  Alternatives  to  UPD  Test 


One  possible  use  of  Theorem  6.4  is  to  provide  alternatives  to  the  use  of  minredk  for 
enforcing  the  UPD  condition.  Rather  than  using  (6.2)  to  identify  questionable 
models,  we  can  define 


false  lf8kBk8k  >  UU*11 


UPD  Test _FCD[  =5 


(6.46) 


true  if  g‘kBjgk  <^k\\gk\\2  . 

Unlike  UPD_Test,  this  test  need  only  be  performed  at  the  start  of  an  iteration  rather 

than  every  time  a  step  is  computed.  Furthermore,  it  is  less  expensive  since  it  can  be 
applied  before  computation  of  a  step.  For  secant  methods  using  OLC  procedures  to 
compute  trial  steps,  we  can  also  use  the  test 


UPD Jest  OLCi  =i 


false 


if  8kBiSk  >  %k  11  8k  11  2 
and  I  gjpk_x  I  <  e0  II  gk  II  llp*_1ll 


(6.47) 


true  if  g‘kB[gk  <^k\\gk\\2 

or  I  8kPk- 1  1  ^  eolU* 11  UPk- 111 
for  some  e0  >  0.  A  suggested  value  is  e0  =  10-6. 
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6.3.6.  Modifying  Quadratics  to  be  Directionally  Bounded 

Theorem  6.4  shows  that  the  UPD  condition  is  satisfied  as  long  as  the  curvature  of  the 
standard  quadratic  model  is  bounded  in  the  gradient  direction.  This  suggests  an 
alternative  to  simply  deleting  models  not  satisfying  UPD_Testk  =  true.  These  models 
can  instead  be  modified  in  a  subspace  containing  gk  so  that  the  curvature  in  the 
gradient  direction  is  corrected. 

Several  procedures  are  available  to  force  the  curvature  in  a  given  direction  to 
take  on  a  specific  value.  One  of  the  simplest  is  as  follows.  Let 
U (B  ,p  ,y)  :JRnxn xIR"  xIRn  —>  IR" xn  represent  a  secant  update  satisfying 


U(B,p,y)p=y. 
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Procedure  A.6.3  :  Simple  Procedure  for  Correcting  a  Model  Hessian 

If  UPDJest_FCDj,  =  false,  then 
Set  P  =  ^  • 

Set  pE  =  gk. 

Setye  =  pg*. 

Compute  B  -U  ( Bk,pe,y£ ). 

Set  Bk  =B . 

End  if. 


This  procedure  only  modifies  B{  if  gkBkgk>^k  gkgk-  After  modification,  B[  will 
satisfy  Bjp£=y£  and  g‘kBkgk  =^k  g‘kgk.  For  a  typical  secant  update,  the  model  is 
only  changed  in  a  two  dimensional  subspace.  If  desired,  an  update  can  be  used 
which  also  forces  Bj  to  continue  to  satisfy  the  secant  equation  from  the  last  iteration: 


of 


BkPk-i=yk-i  •  (6-48) 

The  procedure  can  be  improved  by  setting  P  to  a  finite  difference  approximation 

glH(xk  )gk 


-.  One  such  approximation  is 


gkSk 


2  f(xk+egk)-f(xk)-e\\gk\\ 


II  ft  I 


(6.49) 


Techniques  for  determining  an  appropriate  value  for  e  can  be  found  in  Section  5.4  of 
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Dennis  and  Schnabel  [1983].  A  central  difference  approximation  could  also  be  used, 
but  would  require  an  extra  function  evaluation. 


Procedure  A.6.4  :  Another  Procedure  for  Correcting  a  Model  Hessian 

If  UPD JTest_FCDlk  =  false,  then 

2  f  (xk+egk)-f  (xk)-e\\ gk\\2 

Set  (3  =  — - - - -  for  an  appropriately  selected 

e2  N  S*H  2 

e. 

Set  p£  =gk. 

Set  ye  =  p&. 

Compute  B  =U  (Bk,pe,ye). 

Set  Bj,  =B . 

End  if. 


After  modification,  Blk  will  satisfy  Bkpe=ye  and  hence 
single  =  - j(f(xk+egk  )-/U*)-ell^H2)- 

Notice  that  £,k  is  only  used  in  determining  whether  modification  of  the  matrix  is 
deemed  necessary.  The  modification  itself  does  not  depend  on  any  arbitrary 
constants.  Hence,  even  if  %k  is  fixed  at  an  arbitrary  value,  Procedure  A.6.4  is 
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essentially  scale  invariant. 

Although  A. 6.4  is  a  large  improvement  to  AS.6.3,  it  still  has  one  undesirable 
feature.  Specifically,  the  algorithm  alway  modifies  B k  so  that  gk  is  an  eigenvector. 
The  following  procedure  does  not  share  this  feature. 


Procedure  A.6.5  :  The  Recommended  Procedure  for  Correcting  a  Model 
Hessian 

If  UPD_Test_FCDk  =  false,  then 
For  an  appropriately  selected  e, 

Set  pe  =  egk. 

Set  ye  =  g(xk+pE)-gk. 

Compute  B  =  U  (Bk,pz,y£). 

Set  Blk  =B . 

End  if. 


After  modification,  Bk  will  satisfy 

BlSk  =  j  (g(xk+egk)-gk)  .  (6.50) 

Lemma  6.6  :  Let  f  satisfy  the  standard  assumptions  and  let  xk  he  in  L(f  pc  {).  Let  e 
be  any  nonzero  scalar  satisfying  xk  +Egk  e  Lif  yc\)- 
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If  (6.50)  is  satisfied,  then  g[  B(  gk  <  II  gk  II  2. 


Proof  :  Application  of  Lemma  2.5  and  the  standard  assumptions  on  /  yield 


Bigk  11  -  “  11  S(xk+egk  )~gk  I' 

^  tPi  11  eSk  11  =  Pi  11  8k  11  • 


(6.51) 


By  the  Cauchy-Schwartz  inequality 


glBigk<\\gkW  II  Bjgk  11  11  S* 


(6.52) 


which  completes  the  proof.  Q 

If  OLC  steps  are  being  computed  for  a  given  secant  model,  UPD_Test_FCD  can 

I  pkyk  I 

be  replaced  by  UPD  Test  OLC  in  any  of  last  three  procedures  provided - is 

“  "  PkPk 


bounded. 


6.4.  Summary  :  Type  Three  Algorithms 

The  methods  of  this  chapter  make  it  possible  to  use  any  model  in  an  algorithm  and 
retain  FOSPC  under  the  standard  assumptions  on  the  function.  These  methods  are 
computationally  efficient  and  can  be  implemented  in  a  relatively  scale  independent 
way. 

Three  tests  are  presented  for  identification  of  questionable  models.  UPD  Test 
can  be  used  with  any  model,  while  UPDJTestJFCD  and  UPD_Test_OLC  apply  to 
quadratics  in  standard  form  with  associated  FCD  or  OLC  procedures  for  computing 
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trial  steps.  These  tests  are  performed  directly  on  the  models  and  thus  do  not  involve 
any  comparison  of  the  model  with  the  function. 

Once  a  questionable  model  is  identified,  it  can  either  be  deleted  from  the  set  of 
models  under  consideration  or  be  modified  to  satisfy  the  UPD  condition.  Deletion  is 
the  simplest  procedure,  but  should  be  avoided  if  possible.  For  quadratic  models,  we 
strongly  recommend  modification  by  Procedure  A.6.5. 


Chapters  5  and  6  provide  several  theoretically  justified  ways  of  deciding  when  to 
“refresh”  or  reset  a  secant  update  to  the  true  Hessian.  First,  a  reset  should  be  done  if 
one  of  the  tests  in  this  chapter  fail  for  the  secant  Hessian  (e.g.,  if  gl^kSk  -  ^k  8k8k  )• 
Furthermore,  a  reset  should  be  done  if  a  step  computed  with  the  secant  model  yields 
P*W)<rl2  (so  ^at  a  trust  region  decrease  is  called  for)  unless  the  Newton  model 

pred^ip)-- glp  -yP1  Hkp  (6.53) 


yields  \ie?(Pk)>ekl(Pk).  Notice  that  it  is  not  necessary  to  evaluate  H (xk)  to  obtain 
predkipi}).  We  can  instead  determine  an  approximation  to  predkipk)  using  finite 
differences.  A  simple  way  to  do  this  is  to  use 


predkipk)  =  -gkPk  ~  y( Pk)‘Hk(Pk ) 
=  -gkPk-^(Pk)‘yk 


(6.54) 


y£  =  -^(8(h+zPk)-8k) 


where 
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and  e  is  a  small  positive  number  selected  as  in  Section  5.4  of  Dennis  and  Schnabel 
[1983].  A  second  order  finite  difference  equation  which  only  requires  one  extra 
function  evaluation  could  also  be  used. 

A  simpler  alternative  to  this  finite  difference  procedure  is  to  use  the  techniques 
presented  in  Chapter  5  associated  with  AS. 5. 6.  A  test  to  guarantee  the  UPD  condition 
(such  as  :  “Refresh  if  gjcBkgk  >£*  gigk")  is  still  required. 


CHAPTER  7 


Concluding  Remarks 

7.1.  Some  Results  Which  Are  Not  Applicable  To  These  Algorithms 

We  include  the  following  examples  to  show  properties  that  are  not  necessarily  true 
for  the  algorithms  we  have  presented. 

7.1.1.  An  Example  Concerning  Local  Convergence  Rates. 

The  algorithms  of  Chapters  5  through  6  rely  on  the  comparison  of  predicted  function 
reduction  with  actual  function  reduction  to  distinguish  between  models.  It  might  be 
speculated  that  such  methods  have  local  convergence  rates  of  the  same  order  as 
would  be  obtained  by  a  single  model  algorithm  that  selects  the  “best”  model  at  each 
iteration.  However,  consider  the  following  problem.  Let  the  function  f  :  1R  — >  IR  be 
defined  so  that 

/(x)  =  y xlx  ,  g(x)=x  ,  and  H(x)  =  I  .  (7.1) 

Thus/  has  minimizer  x*  =0  with /(;c*)  =  0. 

For  all  k  >  0,  define  two  possible  model  Hessians  by 

Bk  =  I  and  Bk  =  ^  j  (7.2) 
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Notice  that  Bk{  is  exact  and  that  Bk  is  positive  definite  with  ( Bk  )  1  =  y  3 

Suppose  that  we  use  these  two  models  to  implement  an  algorithm  such  as  A.6.1 
and  suppose  that  the  currently  preferred  model  is  always  selected  as  the  initial  trial 
model  for  the  next  iteration  unless  the  alternate  model  does  a  better  job  of  predicting 
the  actual  function  reduction.  Further  suppose  that  the  algorithm  is  started  with 
x  j  =  £  |  ,  Aj  =  2  and  with  B  2  as  the  initial  trial  model.  It  follows  that 

Pi  =  -(^i2)-1#!  =  J}  *  (7-3) 

v. 

pred  2  (pi)  =  ~g\p  \  ~  ~(p  i)‘Bi(pi)  =  ^-  ,  (7.4) 

and 


arcd1(p12)  =  y  ,  (7.5) 

so  that  P\(P\)=  1  •  Thus  the  trial  step  is  accepted,  and  the  next  iterate  is 


With  no  reason  to  change  models  the  algorithm  retains  model  two  as  the  initial 
trial  model  in  the  next  iteration.  Furthermore,  no  change  in  trust  radius  is  required. 
It  follows  that  the  next  iteration  generates 


pi  (pi)  =  1  . 


(7.6) 


As  with  the  previous  iteration,  the  trial  step  is  acceptable,  so  that  the  third  iterate  is 
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*3  = 


2 

J_ 

2 

V.  y 


(7.7) 


Again,  no  trust  radius  reduction  or  model  switch  is  required  before  the  next  iteration. 
Clearly,  the  iterates  generated  by  this  algorithm  satisfy  xk  =y**_2  for  k>  2,  and 

V2 

for  every  k  >  1  we  have  II  jc*+1  -x*  II  =  —  11**  -x*  II .  Despite  this,  we  have 
p  2  (p2)  _  i  ancj  thus  aredk(pf)=predk(pk)  at  every  iteration. 


We  see  that  tests  based  on  comparing  actual  reduction  to  predicted  reduction  are 
capable  of  retaining  a  model  that  yields  slow  linear  convergence  even  though  an 
alternate  model  would  converge  in  one  iteration.  Furthermore,  this  example  does  not 
depend  on  any  particular  trust  region  or  line  search  method  since  each  step  is  a  full 
quasi-Newton  step  with  the  trust  region  constraint  inactive.  The  example  can  thus  be 
applied  to  most  formulations  of  each  of  the  basic  types  of  algorithms  discussed  in 
Chapter  3  since  most  are  essentially  equivalent  whenever  the  trust  region  constraint  is 
not  binding. 


7.1.2.  An  Example  Concerning  Convergence  to  a  Saddle  Point. 


The  single  model  trust  region  algorithms  of  SSB  [1985]  and  More,  Sorensen  [1982, 
1983]  make  use  of  negative  curvature  in  solving  the  trust  region  subproblem  (TRS)  in 
such  a  way  that  convergence  to  saddle  points  is  inhibited.  Specifically,  these 
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algorithms  have  the  property  of  second  order  stationary  point  convergence1  when 
implemented  with  the  exact  Hessian. 

Suppose  that  a  type  two  or  type  three  algorithm  using  two  quadratic  models  is 
implemented  with  the  exact  Hessian  as  one  of  the  model  Hessians.  It  might  be 
speculated  that  such  an  algorithm  also  shares  the  property  of  second  order  stationary 
point  convergence. 

■^1 

However,  consider  the  following  problem  for  x  =  •  We  define  the  function 

A 

/  :  IR3  — > IR  as 

/(0  =  y[(4,)2+fe)2-fe)2]  P-8) 

so  that  the  gradient  and  Hessian  are  expressed  by 

'5ll  i  0  o’ 

g(x)=  ^  ,  H(x)=  0  10.  (7.9) 

-^J  [  0  0-1. 

This  function  is  indefinite  with  a  saddle  point  at  x*  =0. 

Define  two  possible  model  Hessians  by 

10  0 

B^-H  (x)  =  0  1  0  (7.10) 

0  0-1. 


1  An  algorithm  is  second  order  stationary  point  convergent  if  (a)  it  generates  iterates  {xk  (  that  are 
FOSPC,  and  (b)  the  matrix  H(x’)  is  positive  semidefinite  at  any  point  **  for  which  xk  -»** . 


and 
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3  10 

Bk2  =  110.  (7.11) 

0  0  1. 

Suppose  that  the  algorithm  described  in  Section  7.1.1  (which  retains  the 
currently  preferred  model  unless  the  alternate  is  significantly  better)  is  applied  to  this 

f  "V 

1 

problem.  It  is  easily  verified  that  if  the  algorithm  is  started  with  x  j  =  1  and  model 

two  as  the  initial  trial  model,  then  the  third  component  of  xk  is  identically  zero  for  all 
k>  1.  Furthermore,  p*(p*2)=l  for  all  k>\  as  in  the  last  example.  Consequently, 
model  one  (the  exact  Hessian)  is  never  selected  as  a  trial  model  and  the  iterates 
converge  to  the  saddle  point  x  =0. 

7.2.  Work  in  Progress 

We  conclude  this  thesis  with  a  brief  discussion  of  two  of  the  more  promising  areas 
currently  being  investigated. 

7.2.1.  Secant  Methods  and  Directional  Boundedness 

In  Chapter  6,  we  showed  that  quadratic  models  that  are  uniformly  directionally 
bounded  will  satisfy  the  UPD  condition.  We  speculate  that  UPD  need  not  be  directly 
enforced  for  secant  methods.  Currently,  attempts  are  being  made  both  to  (a)  relax  the 
requirement  of  UPD  in  our  theory  in  such  a  way  as  to  admit  secant  updates,  and  (b) 
show  that  existing  secant  updates  are  uniformly  directionally  bounded.  Several 
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encouraging  results2  have  been  established,  but  at  this  time  the  question  remains 
open. 

7.2.2.  Achieving  Nearly  Optimal  Local  Convergence  Rates 

If  the  convergence  rate  associated  with  each  model  is  known  in  advance,  achieving 
the  optimal  rate  for  a  multi-model  algorithm  is  simply  a  matter  of,  at  each  iteration, 
initially  selecting  the  model  associated  with  the  highest  rate.  For  example,  most  of 
the  algorithms  of  Al-Baali  and  Fletcher  [1985]  are  biased  toward  selecting  the  secant 
approximation.  This  is  consistent  with  the  assumption  that  the  secant  model  is 
associated  with  faster  convergence  than  the  Gauss-Newton  model. 

Unfortunately,  one  of  the  greatest  potential  uses  of  a  multi-model  algorithm  is  to 
efficiently  solve  problems  of  many  different  categories  with  one  algorithm.  Consider 
the  NLS  problem.  As  previously  mentioned,  a  secant  model  can  be  associated  with 
either  faster  or  slower  convergence  rates  than  the  Gauss-Newton  model  depending  on 
whether  the  problem  is  zero  residual,  small  residual,  or  large  residual.  Other  examples 
are  easily  found  where  a  priori  knowledge  is  not  available  about  “which  model  is 
right.”  The  lack  of  a  nonpreferential  multi-model  algorithm  that  guarantees  fast  local 
convergence  is  thus  unacceptable. 

The  most  promising  approach  for  designing  such  a  method  seems  to  be  as 
follows.  Consider  an  algorithm  such  as  A. 3.1  .  We  refer  to  an  iteration  that 

2  For  example,  it  can  be  shown  that  if  a  subsequence  of  the  models  satisfy  the  UPD  condition, 
versions  of  type  one  and  type  two  algorithms  exist  that  are  WFOSPC. 
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computes  a  trial  step  for  every  model  as  a  greedy  step.  Such  an  iteration  is  usually3 
capable  of  selecting  between  trial  steps  computed  using  an  exact  Hessian  and  trial 
steps  computed  using  a  constant  Hessian  approximation.  The  ability  to  make  this 
distinction  is  crucial  to  achieving  fast  local  convergence  rates  and  to  extending  the 
More  and  Sorensen  [1982,  1983]  second  order  stationary  point  convergence  results  to 
multi-model  algorithms.  Algorithm  A.3.1  performs  a  greedy  step  at  each  iteration 
and  selects  the  model  yielding  the  largest  function  reduction.  Although  this  algorithm 
is  too  expensive  to  be  useful  for  general  problems,  a  similar  method  that  only 
requires  a  greedy  step  at  infrequent  intervals  should  not  be  prohibitively  expensive. 
Current  research  is  centered  on  finding  inexpensive  algorithms  that  are  seldom 
required  to  execute  a  greedy  step  yet  are  assured  of  converging  with  a  nearly  optimal 
rate. 


3  For  the  cases  in  which  the  values  of  aredk  at  different  trial  steps  are  “indistinguishable,”  it  can 
be  shown  that  any  of  the  trial  steps  can  be  selected  without  significantly  effecting  the  convergence 
rates. 
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APPENDIX  A 


GLOSSARY 


The  following  is  a  brief  glossary  of  terminology  used  in  this  thesis.  Selected  section 
references  are  also  presented. 


I.  Abbreviations  and  Acronyms 


BFGS 

DFP 

DGW  [1981] 
FCD  step 


FOSPC 

NLS 

NL2SOL 


OLC  step 
SSB  [1985] 
TRS 
UPD 

WFOSPC 


The  most  widely  used  secant  method.  See  Section  1.1.1. 

Another  widely  used  secant  method.  See  Section  1.1.1. 

Dennis,  Gay,  Welsch  [1981  a,b].  See  Section  1.1.1. 

A  step  for  which  the  model  predicts  a  decrease  of  at  least  a 
fraction  of  the  decrease  predicted  for  the  Cauchy  point.  See 
Section  4.2.1. 

First  order  stationary  point  convergence.  See  Section  2.2.2. 

The  nonlinear  least  squares  problem.  See  Section  1.1.1. 

The  model  switching  algorithm  presented  in  DGW  [1981]  for  the 
solution  of  NLS  problems.  See  Section  1.1.1. 

An  optimal  locally  constrained  step.  See  Sections  2.3.1,  4.2.1. 
Schultz,  Schnabel,  Byrd  [1985].  See  Section  2.4. 

The  trust  region  subproblem.  See  Section  2.3. 

Uniformly  predicted  decrease.  See  Section  5.2. 

Weak  first  order  stationary  point  convergence.  See  Section  2.2.2. 
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2.  Basic  Notation 

See  also  :  Special  symbols  and  parameters,  Section  3  of  this  Appendix. 


aredk  (p ) 
A  (i  ,A* ) 
A(  A*) 


Bk,Bi&  IR"*" 


A, 

A* 

ek(p)>4(p) 

f  :  r  -»  R1 
g  :  nr  — >  R" 

8t  6  Rn 

H  :  R"  -»  R"x" 
Hk  e  IRnX" 
i 

h 


h 


J  :  Rn  ->  R'1*"1 


The  actual  function  reduction  f(xk)-f(xk+p).  See  equations 
(2.1c),  (4.10). 

The  set  of  “acceptable”  steps  at  iteration  k  for  model  i  with 
trust  radius  A*.  See  Section  4.2.1. 

The  set  of  “acceptable”  steps  at  iteration  k  with  trust  radius  Ak, 
where  the  model  is  understood  from  context.  See  Section  4.2.1. 

An  approximation  to  H(xk).  Superscripts  are  used  to  distinguish 
between  different  model  Hessians  at  the  same  iteration  k.  See 
equations  (2.2b),  (4.1). 

The  trust  radius  at  iteration  k.  See  equation  (TRS),  Section  2.3. 
The  trial  trust  radius  at  iteration  k .  See  Algorithm  A.4.3. 

The  observed  error  for  a  step  p  :  ek(p)  =  I  aredk(p)  -  predk(p)  I 
Superscripts  are  used  to  distinguish  between  different  models  at 
the  same  iteration  k.  See  equations  (3.4),  (5.2). 

The  function  to  be  optimized.  See  equation  (1.1). 

The  gradient  of  / . 

Denotes  g  ( xk ). 

The  Hessian  of  /. 

Denotes  H  ( xk ). 

The  initial  trial  model  at  a  given  iteration.  See  Algorithm  A.4.3. 

A  set  of  indices  of  models  under  consideration  at  iteration  k. 
See  equation  (5.3). 

A  set  of  indices  of  models  at  iteration  k.  Typically,  /*  is  the  set 
of  models  for  which  steps  have  been  computed.  See  equations 
(5.4),  (5.5). 

A  set  of  indices  of  models  at  iteration  k.  Typically,  /*  is  the  set 
of  models  for  which  trial  steps  have  not  been  computed.  See 
equations  (5.4),  (5.5). 

The  Jacobian  of  r. 


Jk  e  R"xm 
L(f  X  i) 
L(f  X  i) 

L(f  x\) 


Denotes  J  (xk ). 

The  level  set  of  /  at  xu  See  Definition  2.6. 

The  convex  hull  of  the  level  set  of  /  at  x{.  See  Definition  2.6. 

An  open  convex  set  containing  L(f  jc{).  See  Section  4.2.2,  or 
Section  4  of  this  Appendix. 
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Nk 

4» 

P  <Pk  >Pk  e 
p  s  A  (i  ,Ak ) 
p  &  A(Ak) 

4 >*(*) 

predk(p)  ,pred‘k(p) 


Pk(p),Pk(p) 


r  :  R"  ->  Rm 

rk  6  IRm 
S  :  R"  — >  RnX" 


x 


.r  e  Rn 
x*  €  R" 

x*  e  R" 

yk 


The  number  of  models  in  the  index  set  /*.  See  equations  (4.1), 
(5.3). 

The  intersection  between  L(f  pc  t)  and  a  certain  ball  about  xm.  See 
equation  (4.19). 

A  step  away  from  xk.  The  superscript  denotes  that  pi  is 
associated  with  the  model  having  index  i . 

An  “acceptable”  step  at  iteration  k  for  model  i  with  trust  radius 
A*.  See  Section  4.2.1. 

An  “acceptable”  step  at  iteration  k  with  trust  radius  A*,  where 
the  model  is  understood  from  context.  See  Section  4.2.1. 

A  model  of  /  with  index  i  at  iteration  k.  See  equations  (2.1a), 
(2.2a). 

Predicted  function  reduction  :  predk (p )  =  f  (xk )  - <j>k (xk  +p ).  See 
equation  (2.1b).  For  standard  quadratics,  this  becomes  : 

predl(p)  =  -gip  -  j-  p‘Bip.  See  equations  (2.1b),  (4.1). 

Superscripts  are  used  to  distinguish  between  different  models  at 
the  same  iteration  k. 

Ratio  of  actual  reduction  to  predicted  reduction  : 

p‘k(p)=  Uredk  --  .  Superscripts  are  used  to  distinguish  between 
predlip) 

different  models  at  the  same  iteration  k.  See  equations  (3.2), 
(4.11). 

The  residual  function  of  the  nonlinear  least  squares  problem. 
See  equation  (1.2). 

Denotes  r(xk). 

The  term  in  the  Hessian  of  the  nonlinear  least  squares  problem 
involving  second  order  derivatives  of  the  components  of  r.  See 
equation  (1.5). 

The  trust  radius  increase  or  decrease  factor.  At  the  end  of  an 
iteration,  A*+1  is  set  to  either  xAt  or  x  II  pk  II  depending  on  the 
algorithm  used.  See  Algorithms  AS.4.6,  AS.5.5. 

A  vector  in  Rn . 

Depending  on  context,  a  local  minimizer  of  /  or  a  first  order 
stationary  point. 

The  iterate  at  step  k. 

An  overestimate  for  the  maximum  curvature  of  the  function  /. 
See  equations  (6.1),  (6.5),  (6.46),  (6.47)  and  Procedure  A.6.2. 

A  vector  used  with  secant  methods.  Typically,  yk=gk+i -gk-  See 
equations  (6.3),  (6.7)  and  (6.9). 
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C* 


A  sequence  of  numbers  associated  with  the  inexact  quasi-Newton 
condition.  They  satisfy  0<£*<1  and  lim  =0.  See  equation 

k  — 

(4.6b). 


3.  Special  Symbols  and  Parameters1 


3.1.  Parameters  Relating  to  Trust  Regions  2 

t|i  Minimum  acceptable  ratio  of  predicted  reduction  to  actual  reduction. 

t|2  If  the  ratio  of  predicted  reduction  to  actual  reduction  exceeds  t|2,  no  trust 

radius  decrease  is  allowed. 

r|3  Ratio  of  predicted  reduction  to  actual  reduction  used  to  trigger  an 

increase  in  trust  radius. 

Yo,Yi  Limiters  for  the  trust  radius  reduction  factor  x.  A  typical  algorithm  will 

select  t£  [ Yo . Yi  1  when  a  trust  radius  reduction  is  called  for.  However, 
most  of  our  algorithms  allow  xe  (0 ,Yi  1- 

Y2,Y3  Limiters  for  the  trust  radius  increase  factor  t.  A  typical  algorithm  will 

select  xe  [Y2.Y3]  when  a  trust  radius  increase  is  called  for. 

a.  This  constant  determines  the  maximum  step  length  as  a  function  of  the 

trust  radius.  All  trust  region  algorithms  enforce  II  pk  II  <(1  +  cti)A*. 


1  A  parameter  is  a  number  used  to  define  or  limit  an  algorithm  or  procedure.  For  example,  if  an 
iterative  procedure  is  considered  to  have  failed  after  performing  £ma,  iterations  without  convergence, 
the  number  kmax  is  said  to  be  a  parameter.  The  precise  value  of  a  parameter  is  typically  defined  only 
when  an  algorithm  is  implemented. 

2  The  descriptions  given  here  are  very  general  and  are  intended  only  to  aid  readers  unfamiliar 
with  trust  region  methods.  The  precise  use  of  these  parameters  sometimes  changes  for  different  algo¬ 
rithms,  so  the  specifications  and  descriptions  in  the  text  of  this  thesis  should  be  considered  definitive. 
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TABLE  2 


PARAMETERS  FOR  TRUST  REGION  METHODS 


Symbol 

Typical 

Legal 

Values 

Sample 

Section 

References 

hi 

0.001 

(0,1) 

2.4,  4.5.1,  4.5.2,  4.5.3 

h2 

0.1 

(hi.i) 

2.4,  4.5.2,  4.5.3 

h3 

0.5 

Ol2.1) 

2.4,  4.5.1 

Yo 

0.1 

(0,1) 

2.4 

Yi 

0.5 

[Yo.  1 ) 

2.4  ,4.5.3 

Y2 

2 

[l.~) 

2.4  ,4.8 

Y3 

4 

[  Y2 . 00  ) 

2.4  ,4.6 

Cfl 

0.1 

(0.1) 

2.3.1,  4.2.1 

3.2.  Parameters  Relating  to  Model  Switching 

H  A  parameter  used  to  control  the  relative  frequency  of  model 

switching.  NL2SOL  uses  p=1.5.  See  Sections  5.5,  5.7  and 
Algorithm  AS.5.3. 

r|1  ,n2  These  parameters  are  primarily  “trust  region  parameters,”  but 

can  also  play  a  role  in  model  switching.  Typically,  if  the  ratio  of 
actual  reduction  to  predicted  reduction  is  less  than  t|2  the 
algorithm  will  consider  switching  models.  If  the  ratio  is  less 
than  rji,  the  algorithm  must  either  switch  models  or  reduce  the 
trust  radius.  See  Section  5.5,  Algorithm  AS.5.3. 
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3.3.  Constants3  Associated  with  the  Curvature  of  the  Problem 

(3,  Under  the  standard  assumptions,  we  assume  an  upper  bound  on  ll//(jt)ll,  say 
1 1  //  Or )  1 1  <P!.  See  Section  4.2.2. 

p2  At  every  iteration,  one  model  is  typically  assumed  to  be  a  standard  quadratic 
with  II B*  II  <p2-  See  Section  4.2.1. 

p3  The  constant  typically  used  with  the  UPD  condition.  See  Sections  5.2  and  6.3. 

p4  A  constant  typically  used  with  the  UPD  condition  and  secant  methods.  See 

Section  6.3. 

e4  A  constant  typically  associated  with  p4.  See  Sections  6.3.2  and  6.3.4. 


3.4.  Parameters  Relating  to  the  Solution  of  the  Trust  Region  Subproblem 

c'!  Parameter  associated  with  an  FCD  step.  See  equations  (2.14),  (4.2). 

c 2  Parameter  associated  with  an  OLC  step.  See  equations  (2.12),  (4.3). 

c  3  Parameter  associated  with  a  UPD  step.  See  equations  (4.4),  (5.1). 

c o  Parameter  used  to  enforce  the  UPD  condition.  See  equation  (6.1). 


3.5.  Miscellaneous  Parameters 

jc  Integer  parameter  used  in  A. 5. 4  and  A. 6.1  to  assure  finite  termination  of  the 
internal  reduction  loop.  See  Section  5.5.3. 

80  Parameter  used  in  AS. 5.6. 

e0  Parameter  used  in  UPD_Test_OLC  for  enforcing  the  UPD  condition.  See  equation 
(6.47). 


3  A  number  is  said  to  be  a  constant  if  it  depends  only  on  a  specific  set  of  parameters  and  a 
specific  problem. 
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4.  Standard  Assumptions  on  the  Function 

For  given  x1eRn  and  /  :  Rn  ->R‘,  the  following  conditions  are  referred  to  as  the 
standard  assumptions  on  the  function  : 

(SA1)  f(x)  is  twice  continuously  differentiable  on  an  open  convex  set  containing 
L(f  jcy).  This  set  is  denoted  L(f  jcJ. 

(SA2)  f(x)  is  bounded  below  on  L(f 

(SA3)  3  Pi >0  such  that  for  all  x  e  L(f  j),  the  Hessian  of  /  satisfies  II  H(x)  II  < px . 


APPENDIX  B 


Of  Things  Not  Treated 


1.  Further  Useful  Concepts 

We  now  present  several  concepts  of  small  importance  to  the  theory  of  multi-model 
algorithms,  but  which  are  of  use  in  actually  writing  an  implementation. 

A  distinction  can  be  made  between  algorithms  which  treat  all  models  more  or 
less  equally,  and  those  which  know  from  the  start  which  one  is  “best”  in  some 
context.  For  example,  consider  an  algorithm  using  a  constant  matrix  for  the  first 
model,  and  a  sequence  of  matrices  for  the  second.  Suppose  { 1 1  fi*2  i  i  }  is  not  known  to 
be  bounded.  The  algorithm  can  be  designed  to  treat  them  equally,  or  to  “fall  back 
on”  the  constant  model  under  certain  conditions.  Logic  which  distinguishes  between 
models  based  on  a  priori  information  is  called  preferential  logic  and  logic  which 
makes  no  such  distinctions  nonpreferential  logic. 

The  program  NL2SOL  described  in  DGW  [1981]  uses  nonpreferential  logic  for 
several  reasons,  among  them  the  fact  that  it  is  generally  not  known  in  advance 
whether  a  NLS  problem  is  small  residual  or  large.  Al-Baali,  Fletcher  [1985]  present 
algorithms  which  treat  the  Gauss-Newton  and  secant  models  in  a  fundamentally 
different  way.  For  example,  some  of  their  algorithms  will  permanently  switch  to  the 
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use  of  the  secant  model  after  a  given  number  of  iterations. 

Another  way  to  initially  select  a  model  is  to  use  past  performance  as  a  guide. 
Hereditary  switching  is  a  switching  system  which  gives  preference  to  different 
models  based  on  the  results  of  tests  that  occurred  in  the  previous  iteration. 
Nonhereditary  switching  is  a  switching  system  which  has  no  memory  of 
performance  of  each  model  at  the  last  iteration. 

In  NL2SOL  the  model  used  to  test  the  step  in  the  most  recent  iteration  is  tried 
first  and  given  preference  unless  the  final  e  -  test  in  the  most  recent  iteration  rated  the 
alternate  as  significantly  better.  NL2SOL  uses  this  hereditary  system  with  the 
justification  that  if  a  model  is  performing  satisfactorily,  one  should  stick  to  it  unless 
there  is  good  reason  not  to,  particularly  since  such  things  as  previous  trust  region 
updates  have  been  determined  via  this  model.  Original  test  results  as  reported  in 
DGW  [1981]  support  this  philosophy. 

The  following  scheme  for  classifying  models  should  clarify  the  different  ways 
model  generation  is  treated.  The  models  are  organized  by  how  they  relate  to  the 
multi-model  algorithm  which  generates  them. 

Independent  model  generation  is  a  system  where  models  are  generated  or 
updated  totally  independently.  In  nonlinear  least  squares,  such  a  system  might  take 
Jk  'Jk  as  one  model  and  a  BFGS  secant  approximation  as  an  alternate. 

Semi-independent  model  generation  is  a  system  where  models  may  depend  on 
others  computed  at  this  iteration,  but  not  on  any  of  the  switching  logic  used  in  the 
last  iteration.  Some  examples  are  as  follows. 
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(a)  In  NLS  problems,  such  a  system  might  take  Bk  =  Jk  'Jk  as  one  model  with 
alternate  Bk  =  Jk  ‘Jk  +  Sk .  Here  Sk  is  one  of  the  structured  secant  approximations 
to  s  (x). 

(b)  In  NLS  problems,  another  example  of  such  a  system  is  a  BFGS  algorithm  which 
resets  to  Jk  ‘  Jk  after  a  fixed  number  of  iterations. 

Dependent  model  generation  is  a  system  where  models  may  depend  on  others 
computed  at  this  iteration,  and  on  switching  logic  used  in  the  previous  iteration.  This 
allows  such  things  as  “restarts.”  Some  examples  of  this  type  of  system  are  as 
follows. 

(a)  Al-Baali,  Fletcher  [1985]  algorithms  which  reset  Sk  to  zero  whenever  the  Gauss- 
Newton  Hessian  is  selected. 

(b)  The  algorithm  of  Nazareth  [1980,1983]  which  chooses  a  parameter  a  to 
interpolate  between  a  secant  update  and  the  Gauss-Newton  model  based  on 
performance  of  each  model  at  the  last  step. 

(c)  MINPACK  (More,  Garbow,  Hillstrom  [1980])  routines  which  “refresh”  a  secant 
approximation  by  setting  it  to  H(xk)  when  it  appears  to  be  behaving  poorly. 

2.  General  Definition  for  Models 

More  generality  can  be  obtained  by  defining  a  model  in  terms  of  the  processes  which 
can  be  associated  with  it.  Such  a  definition  allows  us  to  conceptually  separate  each 
different  way  a  model  can  be  used. 
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When  referred  to  in  context  as  an  abstract  concept,  a  model  having  indices  i  and 
k  is  something  having  one  or  more  of  the  following  processes  associated  with  it. 

(a)  Evaluating  for  some  given  x  the  single  valued  function  <t»*(x)  '■  with  til  a 

nonempty  subset  of  Rn ,  or 

(b)  Providing  a  trial  step  pi  e  R\  or 

(c)  Setting  up  information  necessary  for  the  execution  of  a  process  associated  with  a 
model  having  indices  j  and  k+ 1. 

Considering  a  model  defined  in  this  fashion  delineates  between  “testing  a  trial 
step”  as  in  (a),  “computing  a  trial  step”  as  in  (b),  and  “updating”  as  in  (c).  Thus  a 
model  could  be  defined  which  only  evaluates  trial  steps  generated  by  other  models 
without  having  the  capability  of  generating  such  trial  steps,  or  vice  versa.  Also,  a 
model  need  not  be  called  upon  to  do  either  at  any  given  step,  as  long  as  it  stores 

enough  information  to  generate  itself  when  needed.1 

Commonly,  any  pi  provided  by  a  model  is  associated  with  a  function  <1 >*(*) 
which  can  be  evaluated  for  various  *,  and  $l(xk+pl)  is  automatically  provided 
whenever  p‘k  is  provided.  However,  we  will  not  rule  out  models  that  can  only  invoke 
processes  to  evaluate  <)>  or  only  to  provide  a  step  pi,  or  only  to  define  a  new  model  for 
the  next  step. 

1  For  example,  with  the  Newton  model  we  need  not  actually  evaluate  the  exact  Hessian  at  each 
iteration  unless  it  is  actually  needed.  A  more  complex  example  is  a  model  which  at  each  iteration 
“remembers”  the  last  m  iterates  and  “updates”  the  identity  matrix  to  produce  a  quadratic  which  in¬ 
terpolates  this  data  as  well  as  possible.  Our  “model”  would  be  the  background  process  of 
“remembering”  roughly  m  vectors  until  needed. 


